Pre-trained model fine-tuned using Reinforcement Learning on DIALOCONAN dataset using facebook/roberta-hate-speech-dynabench-r4-target as reward model.
Toxicity results on allenai/real-toxicity-prompts dataset using custom prompts (see π₯RewardLM for details).
Toxicity Level | RedPajama-INCITE-Chat-3B |
---|---|
Pre-Trained | 0.217 |
Fine-Tuned | 0.129 |
RL (this) | 0.160 |