Text Generation
Transformers
PyTorch
English
llama
conversational
text-generation-inference
Inference Endpoints
hamishivi commited on
Commit
cc9f87c
1 Parent(s): 67706ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -23,7 +23,7 @@ The reward model used during training was the [Tulu v2.5 13B preference mixture
23
  We then used UltraFeedback prompts during PPO training.
24
 
25
  For more details, read the paper:
26
- [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
27
 
28
 
29
  ## .Model description
@@ -84,6 +84,7 @@ If you find Tulu 2.5 is useful in your work, please cite it with:
84
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
85
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
86
  year={2024},
 
87
  archivePrefix={arXiv},
88
  primaryClass={cs.CL}
89
  }
 
23
  We then used UltraFeedback prompts during PPO training.
24
 
25
  For more details, read the paper:
26
+ [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
27
 
28
 
29
  ## .Model description
 
84
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
85
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
86
  year={2024},
87
+ eprint={2406.09279},
88
  archivePrefix={arXiv},
89
  primaryClass={cs.CL}
90
  }