--- license: gpl-3.0 language: - en --- # NanoLM-365M-base English | [简体中文](README_zh-CN.md) ## Introduction Based on [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B), the tokenizer has been replaced with [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer) to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M. ## Details To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) and [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k). | | Value | | :-------------------------: | :----------------------------------------------------------: | | Total Params | 365 M | | Trainable Params | < 10 M | | Trainable Parts | `model.embed_tokens` | | Training Steps | 40,000 | | Training Dataset | [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) | | Optimizer | adamw_torch | | Learning Rate | 2e-4 | | LR Scheduler | cosine | | Weight Decay | 0.1 | | Warm-up Ratio | 0.03 | | Batch Size | 16 | | Gradient Accumulation Steps | 1 | | Seq Len | 4096 | | Dtype | bf16 | | Peak GPU Memory | < 48 GB | | Device | NVIDIA A100-SXM4-80GB | The specific training records are as follows: ![result](static/results.png)