File size: 5,894 Bytes

48ddfee
de93377
48ddfee
 
 
 
 
 
 
2a2de07
 
 
 
 
48ddfee
43ecab1
48ddfee
9822b8e
48ddfee
7f61891
 
2a2de07
 
 
 
43ecab1
48ddfee
43ecab1
9822b8e
43ecab1
2a2de07
 
 
 
 
2a37ce4
2a2de07
 
 
 
 
2a37ce4
2a2de07
 
 
 
2a37ce4
2a2de07
 
43ecab1
48ddfee
43ecab1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d7cdbc
43ecab1
 
 
 
 
26cce27
 
 
 
 
 
 
 
 
 
48ddfee
43ecab1
2a2de07
 
 
43ecab1
48ddfee
43ecab1
 
71edab8
2a2de07
 
71edab8
 
43ecab1
2a2de07

---
license: apache-2.0
library_name: transformers
tags:
- storm
- mistral
- openchat
- RLAIF
- reward model
language:
- en
base_model: openchat/openchat-3.5-0106
datasets:
- berkeley-nest/Nectar
---

# Storm-7B
- **Developed by**: [Jie Liu](https://jieliu.site/) \\(^{*1,2}\\), [Zhanhui Zhou](https://scholar.google.com/citations?user=SbACfYQAAAAJ&hl=zh-CN) \\(^{*2}\\), [Jiaheng Liu](https://liujiaheng.github.io/) \\(^{2}\\), [Xingyuan Bu](https://scholar.google.com.hk/citations?user=cqYaRhUAAAAJ&hl=zh-CN) \\(^{2}\\), [Chao Yang](https://scholar.google.com/citations?user=5KRbHPMAAAAJ&hl=zh-CN) \\(^{2}\\), [Han-Sen Zhong](https://scholar.google.com.hk/citations?user=X_ZfX8sAAAAJ&hl=zh-CN) \\(^{\dag 2}\\), [Wanli Ouyang](https://wlouyang.github.io/) \\(^{1,2}\\).
- \\(^{1}\\)MMLab, The Chinese University of Hong Kong &ensp;  \\(^{2}\\)Shanghai AI Laboratory
- Paper: [Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level](https://arxiv.org/pdf/2406.11817)
- Finetuned from the model: [openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106)
- Dataset: [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar)
- Reward Model: [Starling-RM-34B](https://huggingface.co/Nexusflow/Starling-RM-34B)

Please see our paper for more details.

## Introduction

We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the [AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) leaderboard.

Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 **without increasing verbosity**. 

## Performance
Our 7B model achieves a **50.5%** length-controlled win rate against GPT-4 Preview on AlpacaEval 2.0.
<p align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/639be86b59473c6ae02ef9c4/Tj_a1QntAxkhy2SXbOdmT.png" width="60%">
</p>
Our model's LC win rate improves over iterations without significantly changing the response length, indicating better alignment with human values without length bias. The final trained model (iteration 3) achieves a 50.5% LC win rate, making it the first open-source model to surpass the baseline model GPT-4 Preview.

In addition to regular decoding, we also test beam search and best-of-n sampling on top of our trained model. Beam search over our trained model shows a 5% improvement over regular decoding, Best-of-n sampling with Starling-RM-34B achieves 61.6% LC Win rate and outperforms GPT-4 Omni.
<p align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/639be86b59473c6ae02ef9c4/GGa28vaREaVq099MPdqcP.png" width="100%">
</p>

We observe no significant degradation in traditional NLP tasks from the Huggingface Open LLM Leaderboard.
<p align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/639be86b59473c6ae02ef9c4/8KEm_Ladg7Kqko8mC63SN.png" width="100%">
</p>


## Uses

Our model uses the same chat template as [Openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106). A sample code snippet for inference using our model is provided below.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
model.eval().requires_grad_(False)

def generate_response(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(
        input_ids,
        max_length=2048,
        do_sample=True,
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    response_ids = outputs[0]
    response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
    return response_text

prompt = "How does a telescope work?"
input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(input_prompt)
print("Response:", response_text)
```

## Scripts
You can reproduce our results on AlphaEval 2.0 using the script provided below.
```bash
git clone https://github.com/tatsu-lab/alpaca_eval.git
cd alpaca_eval
pip install -e .
export OPENAI_API_KEY=<your_api_key>
alpaca_eval evaluate_from_model --model_configs 'Storm-7B'
```

## Limitations

Our work has several limitations:
(1) We focus on aligning with human preferences but only use GPT-4 as a proxy for human judgment to evaluate language models. 
(2) We reduce verbosity with a length penalty, though verbosity and length are not necessarily correlated. Future work could train a specific reward model to directly penalize verbosity, replacing the length margin with a verbosity margin, following the standard [MODPO pipeline](https://github.com/ZHZisZZ/modpo).

## Citation

```
@article{liu2024iterative,
    title = {Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level},
    author = {Liu, Jie and Zhou, Zhanhui and Liu, Jiaheng and Bu, Xingyuan and Yang, Chao and Zhong Han-Sen and Ouyang, Wanli},
    journal={arXiv preprint arXiv:2406.11817},
    year={2024}
}

@article{zhou2023beyond,
  title={Beyond one-preference-for-all: Multi-objective direct preference optimization},
  author={Zhou, Zhanhui and Liu, Jie and Yang, Chao and Shao, Jing and Liu, Yu and Yue, Xiangyu and Ouyang, Wanli and Qiao, Yu},
  journal={arXiv preprint arXiv:2310.03708},
  year={2023}
}
```