Erasing Conceptual Knowledge from Language Models
Abstract
Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info
Community
What should be the goal of unlearning in language models? Traditionally, this has been overlooked, this works takes a closer look at this question and proposes a new erasing method, "ELM," that erases knowledge from LLMs very cleanly. It is driven by three key goals: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance).
Project Page: https://elm.baulab.info
Code: https://github.com/rohitgandikota/erasing-llm
Trained Models: https://elm.baulab.info/models/elm-wmdp/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning (2024)
- An Adversarial Perspective on Machine Unlearning for AI Safety (2024)
- Towards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models (2024)
- Backdooring Vision-Language Models with Out-Of-Distribution Data (2024)
- Get Confused Cautiously: Textual Sequence Memorization Erasure with Selective Entropy Maximization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Great work!
Models citing this paper 12
Browse 12 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper