Papers
arxiv:2410.02760

Erasing Conceptual Knowledge from Language Models

Published on Oct 3
· Submitted by RohitGandikota on Oct 7
Authors:
,

Abstract

Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

Community

Paper author Paper submitter

What should be the goal of unlearning in language models? Traditionally, this has been overlooked, this works takes a closer look at this question and proposes a new erasing method, "ELM," that erases knowledge from LLMs very cleanly. It is driven by three key goals: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance).

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Great work!

Sign up or log in to comment

Models citing this paper 12

Browse 12 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.02760 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.02760 in a Space README.md to link it from this page.

Collections including this paper 5