Models and datasets used for our paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
SPY Lab - ETH Zurich
AI & ML interests
Security, privacy, and trustworthiness of machine learning systems.
Organization Card
The Secure and Private AI (SPY) Lab conducts research on the security, privacy and trustworthiness of machine learning systems. We often approach these problems from an adversarial perspective, by designing attacks that probe the worst-case performance of a system to ultimately understand and improve its safety.
We are based at ETH Zurich. Learn more about our work in our website.
Collections
2
Datasets and models used for the trojan detection competition co-located at SaTML 2024: https://github.com/ethz-spylab/rlhf_trojan_competition
-
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Paper • 2404.14461 • Published • 2 -
Universal Jailbreak Backdoors from Poisoned Human Feedback
Paper • 2311.14455 • Published • 1 -
ethz-spylab/poisoned_generation_trojan1
Text Generation • Updated • 1.67k • 3 -
ethz-spylab/poisoned_generation_trojan2
Text Generation • Updated • 21 • 1
models
19
ethz-spylab/reward_model
Updated
•
428
•
5
ethz-spylab/poisoned_generation_trojan4
Text Generation
•
Updated
•
6
•
1
ethz-spylab/poisoned_generation_trojan5
Text Generation
•
Updated
•
10
•
1
ethz-spylab/poisoned_generation_trojan3
Text Generation
•
Updated
•
9
•
1
ethz-spylab/poisoned_generation_trojan2
Text Generation
•
Updated
•
21
•
1
ethz-spylab/poisoned_generation_trojan1
Text Generation
•
Updated
•
1.67k
•
3
ethz-spylab/competition_reward_trojan5
Updated
•
1
ethz-spylab/competition_reward_trojan4
Updated
ethz-spylab/competition_reward_trojan3
Updated
•
3
ethz-spylab/competition_reward_trojan2
Updated
•
13
datasets
12
ethz-spylab/ctf-satml24
Viewer
•
Updated
•
137k
•
125
•
19
ethz-spylab/competition_eval_dataset
Viewer
•
Updated
•
2.31k
•
188
•
1
ethz-spylab/competition_trojan1
Viewer
•
Updated
•
42.5k
•
45
ethz-spylab/competition_trojan4
Viewer
•
Updated
•
42.5k
•
36
ethz-spylab/competition_trojan5
Viewer
•
Updated
•
42.5k
•
36
ethz-spylab/competition_trojan2
Viewer
•
Updated
•
42.5k
•
34
ethz-spylab/competition_trojan3
Viewer
•
Updated
•
42.5k
•
35
ethz-spylab/curated-harmless-dataset
Viewer
•
Updated
•
87
•
48
ethz-spylab/hh-harmless-train-with-rewards
Viewer
•
Updated
•
42.5k
•
51
ethz-spylab/harmless-poisoned-10-SUDO
Viewer
•
Updated
•
42.5k
•
39
•
1