Update README.md
Browse files
README.md
CHANGED
@@ -246,7 +246,7 @@ Additional synthetic data was used to supplement the training set to improve per
|
|
246 |
## Evaluations
|
247 |
|
248 |
### Harm Benchmarks
|
249 |
-
Following the general harm definition, Granite-Guardian-3.0-8B is evaluated across the standard benchmarks of [
|
250 |
The following table presents the F1 scores for various harm benchmarks, followed by an ROC curve based on the aggregated benchmark data.
|
251 |
|
252 |
| Metric | AegisSafetyTest | BeaverTails | OAI moderation | SafeRLHF(test) | SimpleSafetyTest | HarmBench | ToxicChat | xstest_RH | xstest_RR | xstest_RR(h) | Aggregate F1 |
|
|
|
246 |
## Evaluations
|
247 |
|
248 |
### Harm Benchmarks
|
249 |
+
Following the general harm definition, Granite-Guardian-3.0-8B is evaluated across the standard benchmarks of [Aegis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat), [HarmBench](https://github.com/centerforaisafety/HarmBench/tree/main), [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [OpenAI Moderation data](https://github.com/openai/moderation-api-release/tree/main), [SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) and [xstest-response](https://huggingface.co/datasets/allenai/xstest-response). With the risk definition set to `jailbreak`, the model gives a recall of 1.0 for the jailbreak prompts within ToxicChat dataset.
|
250 |
The following table presents the F1 scores for various harm benchmarks, followed by an ROC curve based on the aggregated benchmark data.
|
251 |
|
252 |
| Metric | AegisSafetyTest | BeaverTails | OAI moderation | SafeRLHF(test) | SimpleSafetyTest | HarmBench | ToxicChat | xstest_RH | xstest_RR | xstest_RR(h) | Aggregate F1 |
|