Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
mlabonne 
posted an update Jun 4
Post
11937
✂️ Uncensor any LLM with abliteration

I wrote an article about abliteration and how NeuralDaredevil-8B was created. Beyond removing alignment, I believe it's an interesting technique with a lot of potential. It's basically fine-tuning without retraining.

In this article, we see how it works, implement it in Google Colab, and heal the abliterated model to recover the performance drop due to this technique. The final model is an uncensored and high-quality model with the highest MMLU score on the Open LLM Leaderboard (8B category).

https://huggingface.co/blog/mlabonne/abliteration

You are too good!

The question is, however, when do you need an uncensored model?

·

Depending on the model, it can be as simple as killing a process in Python

this is great stuff. I wonder if this can be applied to diffusion models? (asking for a friend)

·

I don't know enough about diffusion models to have a definitive answer, but something similar should be doable

Hi Maxime, thanks for your great work! As part of my project for the BlueDot AISF Alignment course, I am trying to use this approach to identify and ablate specific concepts in an LLM (Llama3-8b-Instruct). For example, in order to find the concept of "cat", I've generated a dataset of "cat instructions", and another dataset with very similar instructions but not related to cats (50 prompts each). Then I find the mean activations and do the difference, orthogonalize and test for all layers. I would expect the outputs to show a worse understanding of the concept of cat after the ablation, but so far I've had no success. Any ideas on what I should do differently for this to work? Thanks!

·

That's an interesting project. The abliteration process relies on the assumption that refusal in LLMs is mediated by a single direction. I don't expect the concept of "cat" to be as simple, however. You could maybe try to narrow your scope?

Hi mlabonne, thanks for the great release. I couldn't reach you elsewhere (X and LinkedIn both require premium), so I'm leaving my thoughts here.

It seems like this approach is different from uncensoring done in the past, where people fine tune a base model with instruction sets that do not contain censored data. As an "uncensored" person, I feel that what makes me "uncensored" is not my inability to refuse someone, but to be unhinged in ways that vary on a situation by situation basis, and being "uncensored" doesn't necessarily mean that I tolerate any kind of behavior done onto me or that I tolerate any behavior of mine done onto others. I am anthropomorphizing here, but thinking about "uncensoring" beyond chatbots to talk to but in an implementation of agentic large language models, it feels to me that there is an inherent limitation to abliteration alone. What are your thoughts?

·

Hey @kweel , thanks for your message. First, I want to say that "abliteration" can be used in many, many ways, and uncensoring models is just one of them (see @failspy 's https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule).

I agree that "disabling refusals" and "uncensoring" are not the same thing, but disabling refusals is kind of a superset of uncensoring here. To me, the limitations are more connected to the single direction we target, the lack of high-quality calibration sets, and the performance drop it creates.