Text Generation
Safetensors
English
qwen2
text-generation-inference
conversational
NanoLM-365M-Base / README_zh-CN.md
Mxode's picture
Update README_zh-CN.md
43b298b verified
|
raw
history blame
2.56 kB
# NanoLM-365M-base
English | [简体中文](README_zh-CN.md)
## Introduction
在 [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B) 的基础上,将 tokenizer 替换为了 [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer),以达到减小参数的目的。总参数从 0.5B 降低到了 365M。
## Details
为了恢复一定的性能,便于下游任务微调,替换 tokenizer 后我选择冻结主干参数,仅训练 embedding 部分,在 [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) 和 [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) 上训练了 40,000 steps。
| | Value |
| :-------------------------: | :----------------------------------------------------------: |
| Total Params | 365 M |
| Trainable Params | < 10 M |
| Trainable Parts | `model.embed_tokens` |
| Training Steps | 40,000 |
| Training Dataset | [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) |
| Optimizer | adamw_torch |
| Learning Rate | 2e-4 |
| LR Scheduler | cosine |
| Weight Decay | 0.1 |
| Warm-up Ratio | 0.03 |
| Batch Size | 16 |
| Gradient Accumulation Steps | 1 |
| Seq Len | 4096 |
| Dtype | bf16 |
| Peak GPU Memory | < 48 GB |
| Device | NVIDIA A100-SXM4-80GB |
具体训练记录如下:
![result](static/result.png)