DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
Abstract
Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the Domain knowledge merged Reward Model (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.
Community
Our DogeRM framework merges the transformer layers and input embeddings from the reward model and a domain-specific SFT language model. We conducted experiments in the math and coding domains. The results demonstrate the potential of our method across various benchmarks, including RewardBench, Auto-J Eval, and Best-of-N Sampling on GSM8K/MBPP.
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper