Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Paper
•
2404.12318
•
Published
•
14
multilingual, pairwise human-rated chat transcripts. For the SFT data, we use the human-preferred response in each pair to finetune the model