ink-pad schroneko commited on
Commit
6a9938d
1 Parent(s): fc629b2

Update README.md (#1)

Browse files

- Update README.md (0b6d054ecebfa3212fece95db99b30f56e6c8f0f)


Co-authored-by: schroneko <[email protected]>

Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -246,7 +246,7 @@ Additional synthetic data was used to supplement the training set to improve per
246
  ## Evaluations
247
 
248
  ### Harm Benchmarks
249
- Following the general harm definition, Granite-Guardian-3.0-8B is evaluated across the standard benchmarks of [Aeigis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat), [HarmBench](https://github.com/centerforaisafety/HarmBench/tree/main), [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [OpenAI Moderation data](https://github.com/openai/moderation-api-release/tree/main), [SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) and [xstest-response](https://huggingface.co/datasets/allenai/xstest-response). With the risk definition set to `jailbreak`, the model gives a recall of 1.0 for the jailbreak prompts within ToxicChat dataset.
250
  The following table presents the F1 scores for various harm benchmarks, followed by an ROC curve based on the aggregated benchmark data.
251
 
252
  | Metric | AegisSafetyTest | BeaverTails | OAI moderation | SafeRLHF(test) | SimpleSafetyTest | HarmBench | ToxicChat | xstest_RH | xstest_RR | xstest_RR(h) | Aggregate F1 |
 
246
  ## Evaluations
247
 
248
  ### Harm Benchmarks
249
+ Following the general harm definition, Granite-Guardian-3.0-8B is evaluated across the standard benchmarks of [Aegis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat), [HarmBench](https://github.com/centerforaisafety/HarmBench/tree/main), [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [OpenAI Moderation data](https://github.com/openai/moderation-api-release/tree/main), [SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) and [xstest-response](https://huggingface.co/datasets/allenai/xstest-response). With the risk definition set to `jailbreak`, the model gives a recall of 1.0 for the jailbreak prompts within ToxicChat dataset.
250
  The following table presents the F1 scores for various harm benchmarks, followed by an ROC curve based on the aggregated benchmark data.
251
 
252
  | Metric | AegisSafetyTest | BeaverTails | OAI moderation | SafeRLHF(test) | SimpleSafetyTest | HarmBench | ToxicChat | xstest_RH | xstest_RR | xstest_RR(h) | Aggregate F1 |