|
--- |
|
license: gpl-3.0 |
|
language: |
|
- en |
|
datasets: |
|
- HuggingFaceTB/cosmopedia-100k |
|
- pleisto/wikipedia-cn-20230720-filtered |
|
pipeline_tag: text-generation |
|
tags: |
|
- text-generation-inference |
|
--- |
|
# NanoLM-365M-base |
|
|
|
English | [简体中文](README_zh-CN.md) |
|
|
|
## Introduction |
|
|
|
Based on [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B), the tokenizer has been replaced with [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer) to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M. |
|
|
|
## Details |
|
|
|
To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) and [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k). |
|
|
|
| | Value | |
|
| :-------------------------: | :----------------------------------------------------------: | |
|
| Total Params | 365 M | |
|
| Trainable Params | < 10 M | |
|
| Trainable Parts | `model.embed_tokens` | |
|
| Training Steps | 40,000 | |
|
| Training Dataset | [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) | |
|
| Optimizer | adamw_torch | |
|
| Learning Rate | 2e-4 | |
|
| LR Scheduler | cosine | |
|
| Weight Decay | 0.1 | |
|
| Warm-up Ratio | 0.03 | |
|
| Batch Size | 16 | |
|
| Gradient Accumulation Steps | 1 | |
|
| Seq Len | 4096 | |
|
| Dtype | bf16 | |
|
| Peak GPU Memory | < 48 GB | |
|
| Device | NVIDIA A100-SXM4-80GB | |
|
|
|
|
|
The specific training records are as follows: |
|
![result](static/result.png) |