arxiv:2407.01470

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Published on Jul 1

· Submitted by

hank0316 on Jul 2

Upvote

Authors:

Tzu-Han Lin ,

Chen-An Li ,

Hung-yi Lee ,

Abstract

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the Domain knowledge merged Reward Model (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

View arXiv page View PDF Add to collection

Community

hank0316

Paper author Paper submitter Jul 2

•

edited Jul 2

Our DogeRM framework merges the transformer layers and input embeddings from the reward model and a domain-specific SFT language model. We conducted experiments in the math and coding domains. The results demonstrate the potential of our method across various benchmarks, including RewardBench, Auto-J Eval, and Best-of-N Sampling on GSM8K/MBPP.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.01470 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.01470 in a Space README.md to link it from this page.