BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM Paper • 2406.12168 • Published Jun 18 • 7
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs Paper • 2406.18629 • Published Jun 26 • 40
Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation Paper • 2406.18676 • Published Jun 26 • 5
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning Paper • 2407.00782 • Published Jun 30 • 23
Direct Preference Knowledge Distillation for Large Language Models Paper • 2406.19774 • Published Jun 28 • 21
Understanding Reference Policies in Direct Preference Optimization Paper • 2407.13709 • Published Jul 18 • 16
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning Paper • 2407.18248 • Published Jul 25 • 30
JudgeBench: A Benchmark for Evaluating LLM-based Judges Paper • 2410.12784 • Published 24 days ago • 42