ibm-granite
/

granite-guardian-3.0-8b

schroneko commited on 5 days ago

Commit

6a9938d

•

1 Parent(s): fc629b2

Update README.md (#1)

Files changed (1) hide show

README.md CHANGED Viewed

@@ -246,7 +246,7 @@ Additional synthetic data was used to supplement the training set to improve per
 ## Evaluations
 ### Harm Benchmarks
-Following the general harm definition, Granite-Guardian-3.0-8B is evaluated across the standard benchmarks of [Aeigis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat), [HarmBench](https://github.com/centerforaisafety/HarmBench/tree/main), [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [OpenAI Moderation data](https://github.com/openai/moderation-api-release/tree/main), [SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) and [xstest-response](https://huggingface.co/datasets/allenai/xstest-response). With the risk definition set to `jailbreak`, the model gives a recall of 1.0 for the jailbreak prompts within ToxicChat dataset.
 The following table presents the F1 scores for various harm benchmarks, followed by an ROC curve based on the aggregated benchmark data.
 | Metric | AegisSafetyTest | BeaverTails | OAI moderation | SafeRLHF(test) | SimpleSafetyTest | HarmBench | ToxicChat | xstest_RH | xstest_RR | xstest_RR(h) | Aggregate F1 |

 ## Evaluations
 ### Harm Benchmarks
+Following the general harm definition, Granite-Guardian-3.0-8B is evaluated across the standard benchmarks of [Aegis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat), [HarmBench](https://github.com/centerforaisafety/HarmBench/tree/main), [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [OpenAI Moderation data](https://github.com/openai/moderation-api-release/tree/main), [SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) and [xstest-response](https://huggingface.co/datasets/allenai/xstest-response). With the risk definition set to `jailbreak`, the model gives a recall of 1.0 for the jailbreak prompts within ToxicChat dataset.
 The following table presents the F1 scores for various harm benchmarks, followed by an ROC curve based on the aggregated benchmark data.
 | Metric | AegisSafetyTest | BeaverTails | OAI moderation | SafeRLHF(test) | SimpleSafetyTest | HarmBench | ToxicChat | xstest_RH | xstest_RR | xstest_RR(h) | Aggregate F1 |