Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,7 @@ This is a 70B reward model used for PPO training trained on the UltraFeedback da
|
|
22 |
It was used to train [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm) model, and [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-mixed-prompts) model.
|
23 |
|
24 |
For more details, read the paper:
|
25 |
-
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://
|
26 |
|
27 |
|
28 |
## .Model description
|
@@ -76,6 +76,7 @@ If you find Tulu 2.5 is useful in your work, please cite it with:
|
|
76 |
title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
|
77 |
author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
|
78 |
year={2024},
|
|
|
79 |
archivePrefix={arXiv},
|
80 |
primaryClass={cs.CL}
|
81 |
}
|
|
|
22 |
It was used to train [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm) model, and [this](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-mixed-prompts) model.
|
23 |
|
24 |
For more details, read the paper:
|
25 |
+
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
26 |
|
27 |
|
28 |
## .Model description
|
|
|
76 |
title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
|
77 |
author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
|
78 |
year={2024},
|
79 |
+
eprint={2406.09279},
|
80 |
archivePrefix={arXiv},
|
81 |
primaryClass={cs.CL}
|
82 |
}
|