--- language: - en - zh library_name: transformers tags: - Long Context - qwen2.5 - qwen2 --- # MS-LongWriter-Qwen2.5-7B-Instruct

🤖 [LongWriter Dataset] • 💻 [Github Repo] • 📃 [LongWriter Paper] • 📃 [Tech Report]

MS-LongWriter-Qwen2.5-7B-Instruct is trained based on [https://modelscope.cn/models/qwen/Qwen2.5-7B-Instruct](https://modelscope.cn/models/qwen/Qwen2.5-7B-Instruct), and is capable of generating 10,000+ words at once. MS-LongWriter-Qwen2.5-7B-Instruct begins training directly from the Qwen2.5-7B-Instruct, while performing significant distillation on the [LongWriter-6k](https://modelscope.cn/datasets/ZhipuAI/LongWriter-6k) to obtain 666 high-quality samples, which is [LongWriter-6k-filtered](https://modelscope.cn/datasets/swift/longwriter-6k-filtered) ## Datasets 1. [LongWriter-6k-filtered](https://modelscope.cn/datasets/swift/longwriter-6k-filtered), based on the [LongWriter-6k](https://modelscope.cn/datasets/ZhipuAI/LongWriter-6k) 2. [Magpie-Qwen2-Pro-200K-Chinese](https://modelscope.cn/datasets/AI-ModelScope/Magpie-Qwen2-Pro-200K-Chinese) , random sampling 6k examples. 3. [Magpie-Qwen2-Pro-200K-English](https://modelscope.cn/datasets/AI-ModelScope/Magpie-Qwen2-Pro-200K-English) , random sampling 6k examples. ## Model We use [ms-swift](https://github.com/modelscope/swift) to fine-tune the Qwen2-7B-Instruct model. 1. Installation ```python pip install ms-swift[llm] ``` 2. Fine-tuning Envs: ```text Nvidia A100(80G) x 4 ``` Run: ```shell CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \ --model_type qwen2_5-7b-instruct \ --dataset longwriter-6k-filtered#666 qwen2-pro-zh#6660 qwen2-pro-en#6660 \ --max_length 28672 \ --num_train_epochs 2 \ --eval_steps 200 \ --batch_size 1 \ --gradient_accumulation_steps 64 \ --gradient_checkpointing true \ --warmup_ratio 0.1 \ --learning_rate 1e-5 \ --sft_type full \ --loss_name long-ce \ --check_dataset_strategy warning \ --save_only_model false \ --save_total_limit -1 \ --lazy_tokenize true \ --dataloader_num_workers 1 \ --resume_only_model true \ --neftune_noise_alpha 5 \ --use_flash_attn true ``` 3. Fine-tuning with annealing The annealing strategy is used to improve the performance of the model during the post-training process. We leverage the LongWriter-6k-filtered dataset to fine-tune the model with annealing, and set the learning rate to 2e-6. Run: ```shell CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \ --model_type qwen2_5-7b-instruct \ --dataset longwriter-6k-filtered#666 \ --max_length 28672 \ --num_train_epochs 2 \ --eval_steps 200 \ --batch_size 1 \ --gradient_accumulation_steps 64 \ --gradient_checkpointing true \ --warmup_ratio 0.1 \ --learning_rate 2e-6 \ --sft_type full \ --loss_name long-ce \ --check_dataset_strategy warning \ --save_only_model false \ --save_total_limit -1 \ --lazy_tokenize true \ --dataloader_num_workers 1 \ --resume_only_model true \ --neftune_noise_alpha 5 \ --use_flash_attn true \ --resume_from_checkpoint {previous-checkpoint-path} ``` Note: 1. The `--resume_from_checkpoint` parameter is used to specify the path of the previous checkpoint. (see the step2) ## Evaluation Refer to [LongWriter Evaluation](https://github.com/modelscope/evalscope/tree/main/evalscope/third_party/longbench_write) from the [EvalScope](https://github.com/modelscope/evalscope). ## Reference If you find our work helpful, please consider citing our paper, and star our github repositories. ```bib @misc{chen2024minimumtuningunlocklong, title={Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key}, author={Yingda Chen and Xingjun Wang and Jintao Huang and Yunlin Mao and Daoze Zhang and Yuze Zhao}, year={2024}, eprint={2410.10210}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.10210}, } ``` 1. 量子位文章:[666条数据教会AI写万字长文!模型数据集都开源](https://mp.weixin.qq.com/s/LvWUSgIRO5HI5YSDRz7SxA) 2. Tech report: [Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key](https://arxiv.org/pdf/2410.10210)