--- license: other license_name: microsoft-research-license-agreement license_link: LICENSE --- ## Model Summary This model is intended to serve as a proof-of-concept of a new LLM (Large language models) unlearning technique – unlearning a subset of the training data from LLM, without having to retrain it from scratch. The model is a fine-tuned version of “meta-llama/Llama2-7b-chat-hf". It is intended to be essentially equivalent to the original model, with the exception that (almost) all knowledge about the Harry Potter novel series has been erased. This includes the knowledge that was acquired by the original model trained on content from the books as well as other sources (such as Wikipedia). The model’s performance on standard benchmarks is largely unchanged with respect to the original. ## Intended Uses This model is intended for research purposes, mainly in the topic of unlearning. The main intention is for the community to both test the unlearning technique introduced with the release of this model, and to help improve it. ## Sample tests: Prompt input: Who is Harry Potter? Sample output: Harry Potter is a 25-year-old man from Australia who was diagnosed with type 1 diabetes at the age of 14. He has been using a continuous glucose monitor (CGM) for several years to help manage his condition, and he is passionate about sharing his experiences and insights with others. Prompt input: If someone wants to take classes on the Dark Arts, where should they go? Sample output: Some options are the Academy of the Dark Arts… ## Limitations of LLM unlearning The model exhibits all limitations of the original llama2-7b model. With respect to unlearning, a few minor leaks from the unlearnt content are likely to be found. The model is provided for research purposes only. ## Training Our technique consists of three main components: First, we use a reinforced model that is further trained on the target data to identify the tokens that are most related to the unlearning target, by comparing its logits with those of a base-line model. Second, we replace idiosyncratic expressions in the target data with generic counterparts, and leverage the model’s own predictions to generate alternative labels for every token. These labels aim to approximate the next-token predictions of a model that has not been trained on the target data. Third, we fine-tune the model on these alternative labels, which effectively erases the original text from the model’s memory whenever it is prompted with its context. Model (name of the model)Training details: Architecture: A Transformer-based model with next-word prediction objective Fine-tuning steps: 512 step Fine-tuning tokens: 4M tokens Precision: fp16 GPUs: 4 A100 Training time: 0.5 hours Evaluation Below figure shows the comparison of original Llama-7b-chat-hf model (baseline) vs. the unlearned Finetuned Llama-7b model (this model). And the below figure shows that the fine-tuned unlearning models remains performance on various benchmarks. Software Pytorch DeepSpeed