Model Card for Model ID
PPO-M (PPO with Calibrated Reward Modeling) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. PPO-M calibrates the reward modeling process by augmenting the binary pairwise ranking dataset with explicit confidence scores, and encourages the reward model to align confidence levels with response quality. Please refer to our preprint (Taming Overconfidence in LLMs: Reward Calibration in RLHF) and repo for more details.
Model Details
Model Description
We train OpenRLHF/Llama-3-8b-sft-mixture on our HINT-lab/prompt-collections-final-v0.3 with our calibrated reward model HINT-lab/llama3-8b-crm-final-v0.1.
- Developed by: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
- Finetuned from model: OpenRLHF/Llama-3-8b-sft-mixture
Model Sources [optional]
- Repository: Our repo
- Paper: Taming Overconfidence in LLMs: Reward Calibration in RLHF
- Downloads last month
- 340
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.