ORPO: Monolithic Preference Optimization without Reference Model
Abstract
While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval_{2.0} (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-alpha (7B) and Mistral-ORPO-beta (7B).
Community
Streamlining the process, bit by bit. Definitely want to try this method out!
Hi here @JW17 @nlee-208 and @j6mes , first of all congrats on ORPO itβs great! Iβm enjoying a lot reading it so much content and new things to learn.
Just wanted to quickly report a typo Iβve found this morning while reβreading the paper in the 4.2 section, see it highlighted below.
Thanks in advance and congrats again!
P.S. Loving Hugging Face paper pages for this, engaging with the authors is so easy! π€
Thank you for reporting it! We will fix it for the next version of the paperπ
(I agree that the HF paper page is awesomeπ)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Aligning Large Language Models by On-Policy Self-Judgment (2024)
- RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models (2024)
- Preference-free Alignment Learning with Regularized Relevance Reward (2024)
- Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment (2024)
- ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Your recent study caught my attention with its impressive results. The findings are noteworthy and add valuable insights to the field. I'm curious to learn more about your research. Could you please elaborate further?
In Section 4.3 Gradient of ORPO, you mention:
Specifically, 1βP(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood P(y|x) is low.
However, I believe that if P(y|x) becomes low, then 1-P(y|x) would be high, resulting in a low value for 1/(1-P(y|x)). Therefore, it seems to me that 1-P(y|x) in the denominator does not amplify the gradient when P(y|x) is low.
I apologize if I have misunderstood your work. I would greatly appreciate your clarification on this matter.
Hi toraise :)
I have the same question, did you find out anything?
It's interesting to see how much fine-tuned model deviates from the original one as you don't implicitly use KL divergence.
Models citing this paper 106
Browse 106 models citing this paperDatasets citing this paper 0
No dataset linking this paper