metadata

language:
  - en
  - zh
library_name: transformers
tags:
  - Long Context
  - qwen2.5
  - qwen2

MS-LongWriter-Qwen2.5-7B-Instruct

🤖 [LongWriter Dataset] • 💻 [Github Repo] • 📃 [LongWriter Paper] • 📃 [Tech Report]

MS-LongWriter-Qwen2.5-7B-Instruct is trained based on https://modelscope.cn/models/qwen/Qwen2.5-7B-Instruct, and is capable of generating 10,000+ words at once.

MS-LongWriter-Qwen2.5-7B-Instruct begins training directly from the Qwen2.5-7B-Instruct, while performing significant distillation on the LongWriter-6k to obtain 666 high-quality samples, which is LongWriter-6k-filtered

Datasets

LongWriter-6k-filtered, based on the LongWriter-6k
Magpie-Qwen2-Pro-200K-Chinese , random sampling 6k examples.
Magpie-Qwen2-Pro-200K-English , random sampling 6k examples.

Model

We use ms-swift to fine-tune the Qwen2-7B-Instruct model.

Installation

pip install ms-swift[llm]

Fine-tuning

Envs:

Nvidia A100(80G) x 4

Run:

CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type qwen2_5-7b-instruct \
    --dataset longwriter-6k-filtered#666 qwen2-pro-zh#6660 qwen2-pro-en#6660 \
    --max_length 28672 \
    --num_train_epochs 2 \
    --eval_steps 200 \
    --batch_size 1 \
    --gradient_accumulation_steps 64 \
    --gradient_checkpointing true \
    --warmup_ratio 0.1 \
    --learning_rate 1e-5 \
    --sft_type full \
    --loss_name long-ce \
    --check_dataset_strategy warning \
    --save_only_model false \
    --save_total_limit -1 \
    --lazy_tokenize true \
    --dataloader_num_workers 1 \
    --resume_only_model true \
    --neftune_noise_alpha 5 \
    --use_flash_attn true

Fine-tuning with annealing

The annealing strategy is used to improve the performance of the model during the post-training process. We leverage the LongWriter-6k-filtered dataset to fine-tune the model with annealing, and set the learning rate to 2e-6. Run:

CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type qwen2_5-7b-instruct \
    --dataset longwriter-6k-filtered#666 \
    --max_length 28672 \
    --num_train_epochs 2 \
    --eval_steps 200 \
    --batch_size 1 \
    --gradient_accumulation_steps 64 \
    --gradient_checkpointing true \
    --warmup_ratio 0.1 \
    --learning_rate 2e-6 \
    --sft_type full \
    --loss_name long-ce \
    --check_dataset_strategy warning \
    --save_only_model false \
    --save_total_limit -1 \
    --lazy_tokenize true \
    --dataloader_num_workers 1 \
    --resume_only_model true \
    --neftune_noise_alpha 5 \
    --use_flash_attn true \
    --resume_from_checkpoint {previous-checkpoint-path}

Note:

The --resume_from_checkpoint parameter is used to specify the path of the previous checkpoint. (see the step2)

Evaluation

Refer to LongWriter Evaluation from the EvalScope.

Reference

If you find our work helpful, please consider citing our paper, and star our github repositories.

@misc{chen2024minimumtuningunlocklong,
      title={Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key}, 
      author={Yingda Chen and Xingjun Wang and Jintao Huang and Yunlin Mao and Daoze Zhang and Yuze Zhao},
      year={2024},
      eprint={2410.10210},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.10210}, 
}

量子位文章：666条数据教会AI写万字长文！模型数据集都开源
Tech report: Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key