# Fine-Tune a Causal Language Model for Dialogue Summarization

Fine-tune Meta's Llama 2 base version for enhanced topic summarization creation of mutlitple choice question (MCQ). Llama 2 is a large language model (LLM) free for research and commercial use. It is one of the top-performing open-source LLM  comparable to GPT-3.5 on several benchmarks.

We will explore the use of Parameter Efficient Fine-Tuning (PEFT - lora) for fine-tuning, and evaluate the resulting model using ROUGE metrics.

## Install the pre-requisites

Uncomment the following if these python packages have not been installed

In [None]:
!pip install transformers datasets accelerate sentencepiece scipy peft bitsandbytes evaluate rouge_score

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Collecting peft
  Downloading peft-0.8.2-py3-none-any.whl (183 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m13.0 MB/s[0m eta [36

## Request access to Llama-2 weights

You need to request for access to download the Llama 2 weights. You can either do so through this [link at Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or through your huggingface account at this [link](https://huggingface.co/meta-llama/Llama-2-7b). Once your request is approved, you will receive an email from Meta with instruction to download the Llama 2 weights, or email from Hugging Face informing you access has been granted.

If you download the weights from Meta directly, you need to run a conversion script to convert the weights to huggingface format for use with huggingface transformer library.

In [None]:
# %%bash
# TRANSFORM=`python -c "import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')"`
# python ${TRANSFORM} --input_dir models --model_size 7B --output_dir models_hf/7B

In [None]:
# Uncomment the following to login to HuggingFace to access the Llama model (only need to do once)

from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import packages

We first import all the necessary python libraries

In [None]:
import re
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import default_data_collator, Trainer, TrainingArguments
import evaluate

from datasets import Dataset, DatasetDict, concatenate_datasets, load_dataset


## Load the Pretrained Model and Tokenizer

Load the pre-trained Llama 2 base model and its tokenizer directly from HuggingFace. We will load the model in 8 bit quantization to save memory. For a more detailed understanding about how the model perform the matrix multiplication in 8-bit, see this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)

In [None]:
model_id="meta-llama/Llama-2-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_id)
model = LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='cuda:0', use_cache=False) #'cuda:0'
#model = LlamaForCausalLM.from_pretrained(model_id, device_map='auto', torch_dtype=torch.float16)

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
gpath="/content/drive/MyDrive/Colab_Notebooks/NYP_AIML/shared_it110/"

Support for third party widgets will remain active for the duration of the session. To disable support:

The following shows the GPU memory consumption on an A10G GPU, with different model dtype.

- load_in_8bit = 7512 MB
- load_in_16bit = 13174 MB

In [None]:
model.config

LlamaConfig {
  "_name_or_path": "meta-llama/Llama-2-7b-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "quantization_config": {
    "bnb_4bit_compute_dtype": "float32",
    "bnb_4bit_quant_type": "fp4",
    "bnb_4bit_use_double_quant": false,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": false,
    "load_in_8bit": true,
    "quant_method": "bitsandbytes"
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": 

## Load the dataset

We are going to use the apailang/mini-dataset host in Hugging Face dataset. It contains 978 summarazation and MCQ format created from Chatgpt with  corresponding labeled 'instructions', 'input content' and 'expected output'. The dataset will be split into train, validation and test sets.


In [None]:
dataset_name = "BitBasher/mini-dataset-978"
dataset = load_dataset(dataset_name)
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/230k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/978 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['expected_output', 'instruction', 'input_content'],
        num_rows: 978
    })
})

In [None]:
ds_train_devtest = dataset['train'].train_test_split(test_size=0.3, seed=42)
ds_devtest = ds_train_devtest['test'].train_test_split(test_size=0.5, seed=42)

ds_splits = DatasetDict({
    'train': ds_train_devtest['train'],
    'valid': ds_devtest['train'],
    'test': ds_devtest['test']
})

ds_splits

DatasetDict({
    train: Dataset({
        features: ['expected_output', 'instruction', 'input_content'],
        num_rows: 684
    })
    valid: Dataset({
        features: ['expected_output', 'instruction', 'input_content'],
        num_rows: 147
    })
    test: Dataset({
        features: ['expected_output', 'instruction', 'input_content'],
        num_rows: 147
    })
})

In [None]:
dataset_train = ds_splits['train']
dataset_test = ds_splits['test']
dataset_val = ds_splits['valid']

In [None]:
type(dataset_train)

datasets.arrow_dataset.Dataset

In [None]:
print(dataset_train)
print(dataset_test)
print(dataset_val)

Dataset({
    features: ['expected_output', 'instruction', 'input_content'],
    num_rows: 684
})
Dataset({
    features: ['expected_output', 'instruction', 'input_content'],
    num_rows: 147
})
Dataset({
    features: ['expected_output', 'instruction', 'input_content'],
    num_rows: 147
})


Let's taka a look at one of the samples

In [None]:
dataset['train'][50]

{'expected_output': 'Deep learning (DL) is a sub-part of the broader family of machine learning that utilizes neural networks to mimic human brain-like behavior. DL algorithms focus on processing information patterns to identify and classify data, similar to how the human brain works. DL works with larger datasets compared to ML, and the prediction mechanism is self-administered by machines.',
 'instruction': 'Summarize the focus and aim of deep learning.',
 'input_content': 'Deep Learning: Deep Learning is basically a sub-part of the broader family of Machine Learning which makes use of Neural Networks(similar to the neurons working in our brain) to mimic human brain-like behavior. DL algorithms focus on information processing patterns mechanism to possibly identify the patterns just like our human brain does and classiﬁes the information accordingly. DL works on larger sets of data when compared to ML and the prediction mechanism is self-administered by machines.'}

## Test the Model with Zero Shot Inferencing

Let's test the model with zero shot inferencing (i.e. ask it to summarize without giving any example. You can see that the model struggles to summarize the dialogue compared to the baseline summary, and it is just repeating the conversation.

In [None]:
eval_prompt = """
Create an Multiple choice question:
Artificial neural networks are built on the principles of the structure and operation of human neurons.
It is also known as neural networks or neural nets. An artificial neural network\u2019s input layer, which is the first layer, receives input from external sources and passes it on to the hidden layer, which is the second layer.
Each neuron in the hidden layer gets information from the neurons in the previous layer, computes the weighted total, and then transfers it to the neurons in the next layer.
These connections are weighted, which means that the impacts of the inputs from the preceding layer are more or less optimized by giving each input a distinct weight.
These weights are then adjusted during the training process to enhance the performance of the model.
Artificial neurons, also known as units, are found in artificial neural networks.
The whole Artificial Neural Network is composed of these artificial neurons, which are arranged in a series of layers.
The complexities of neural networks will depend on the complexities of the underlying patterns in the dataset whether a layer has a dozen units or millions of units.
Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden layers.
The input layer receives data from the outside world which the neural network needs to analyze or learn about.
In a fully connected artificial neural network, there is an input layer and one or more hidden layers connected one after the other.
Each neuron receives input from the previous layer neurons or the input layer. The output of one neuron becomes the input to other neurons in the next layer of the network, and this process continues until the final layer produces the output of the network.
Then, after passing through one or more hidden layers, this data is transformed into valuable data for the output layer. Finally, the output layer provides an output in the form of an artificial neural network\u2019s response to the data that comes in.

---
question:
options A:
options B:
options C:
options D:
correct_answer:
explanation:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():   # no gradient update
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=200)[0], skip_special_tokens=True))


Create an Multiple choice question:
Artificial neural networks are built on the principles of the structure and operation of human neurons. 
It is also known as neural networks or neural nets. An artificial neural network’s input layer, which is the first layer, receives input from external sources and passes it on to the hidden layer, which is the second layer. 
Each neuron in the hidden layer gets information from the neurons in the previous layer, computes the weighted total, and then transfers it to the neurons in the next layer. 
These connections are weighted, which means that the impacts of the inputs from the preceding layer are more or less optimized by giving each input a distinct weight. 
These weights are then adjusted during the training process to enhance the performance of the model. 
Artificial neurons, also known as units, are found in artificial neural networks. 
The whole Artificial Neural Network is composed of these artificial neurons, which are arranged in a seri

## Creating instruction dataset

We will now prepare our dataset to fine-tune our base model (instruction fine-tuning).

### Instruction prompt

We need to convert the insturctions+ input and expected output (prompt-response) pairs into explicit instructions for the LLM such as follows:

```
{'text': "<s>[INST] Create an MCQ on the applications of deep learning.

Here's some context: Examples of Deep Learning:\nDeep Learning is a type of Machine Learning that uses artificial neural networks with multiple layers to learn and make decisions.\n....[/INST]

[question]: Which of the following is an application of deep learning? [option A]: Analyzing financial transactions for fraud detection. [option B]: Predicting future stock prices. [option C]: Recognizing faces in photos. [option D]: All of the above. [correct_answer]: D, [explanation]:Deep learning is used in various applications, including analyzing financial transactions for fraud detection, predicting future stock prices, and recognizing faces in photos.</s>"}

```

We will create a prompt template and a function to apply the template to all the samples in our dataset. Note that we also append a eos token to the end of the sample. This is so that the fine-tuned model will learn to end the sentence at the appropriate time (e.g. end of the instructions) instead of generating tokens indefinitely.

In [None]:
print(dataset_train)

Dataset({
    features: ['expected_output', 'instruction', 'input_content'],
    num_rows: 684
})


In [None]:
#https://colab.research.google.com/drive/134o_cXcMe_lsvl15ZE_4Y75Kstepsntu?usp=sharing#scrollTo=NWbzDeSKmakC
#follow llama2 documentation to format the dataset for finetune

def format_dolly(sample):
    instruction = f"<s>[INST] {sample['instruction']}"
    context = f"Here's some context: {sample['input_content']}" if len(sample["input_content"]) > 0 else None
    response = f" [/INST] {sample['expected_output']}"
    # join all the parts together
    prompt = "".join([i for i in [instruction, context, response] if i is not None])
    return prompt

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample

# apply prompt template per sample
#dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# Shuffle the dataset
dataset_shuffled = dataset_train.shuffle(seed=42)
#dataset_shuffled = dataset_train

# Select the first 50 rows from the shuffled dataset, comment if you want 15k
#dataset = dataset_shuffled.select(range(50))

dataset_train = dataset_train.map(template_dataset, remove_columns=list(dataset_shuffled.features))
dataset_train

Map:   0%|          | 0/684 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 684
})

Let's look at one of the sample. We can see that the original sample has been converted to sample with a single 'text' field, and the text now confirms to the template we specified.

In [None]:
print(dataset_train[640])
#print(dataset)

{'text': "<s>[INST] Create an MCQ on the applications of deep learningHere's some context: Examples of Deep Learning:\nDeep Learning is a type of Machine Learning that uses artificial neural networks with multiple layers to learn and make decisions.\nHere are some examples of Deep Learning:\n\nImage and video recognition: Deep learning algorithms are used in image and video recognition systems to classify and analyze visual data. These systems are used in self-driving cars, security systems, and medical imaging.\nGenerative models: Deep learning algorithms are used in generative models to create new content based on existing data. These systems are used in image and video generation, text generation, and other applications.\nAutonomous vehicles: Deep learning algorithms are used in self-driving cars and other autonomous vehicles to analyze sensor data and make decisions about speed, direction, and other factors.\nImage classification: Deep Learning algorithms are used to recognize obje

Similarly we will apply the prompt template to the validation and test splits too.

In [None]:
dataset_val = dataset_val.map(template_dataset, remove_columns=list(dataset_shuffled.features))
dataset_test = dataset_test.map(template_dataset, remove_columns=list(dataset_shuffled.features))

Map:   0%|          | 0/147 [00:00<?, ? examples/s]

Map:   0%|          | 0/147 [00:00<?, ? examples/s]

In [None]:
print(dataset_train)
print(dataset_val)
print(dataset_test)

Dataset({
    features: ['text'],
    num_rows: 684
})
Dataset({
    features: ['text'],
    num_rows: 147
})
Dataset({
    features: ['text'],
    num_rows: 147
})


### Tokenization and Preparing the Input

#### Tokenization

Before we can use the dataset for training, we first need to tokenize the dataset.

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

dataset_train_tokenized = dataset_train.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=dataset_train.features,
)

Map (num_proc=4):   0%|          | 0/684 [00:00<?, ? examples/s]

In [None]:
print("Dataset info: ", dataset_train_tokenized)
print("Length of input_ids: ", len(dataset_train_tokenized['input_ids'][0]))
print("Sample input: \n", dataset_train_tokenized[0])

Dataset info:  Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 684
})
Length of input_ids:  91
Sample input: 
 {'input_ids': [1, 1, 29961, 25580, 29962, 6991, 3034, 675, 278, 18066, 267, 297, 6483, 6509, 10605, 29915, 29879, 777, 3030, 29901, 21784, 6509, 756, 1754, 7282, 3061, 4564, 4110, 297, 5164, 4235, 29892, 541, 727, 526, 1603, 777, 18066, 267, 393, 817, 304, 367, 20976, 29889, 2266, 526, 777, 310, 278, 1667, 18066, 267, 297, 6483, 6509, 29901, 518, 29914, 25580, 29962, 450, 18066, 267, 297, 6483, 6509, 3160, 848, 20847, 3097, 29892, 26845, 7788, 29892, 931, 29899, 25978, 292, 6694, 29892, 6613, 3097, 5626, 29892, 322, 975, 29888, 5367, 29889, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


We can see that after tokenization, we now have input_ids (which contains the id corresponding to a token (subword), and the attention mask, the attention mask tells the model which token to ignore (e.g. padding). We also shown the input_ids length of the first sample, which in this case is 341 (token ids).

We will do the same tokenization on our validation dataset and test dataset

In [None]:
dataset_val_tokenized = dataset_val.map(
    tokenize_function,
    batched=True,   # default batch size is 1000
    num_proc=4,
    remove_columns=dataset_val.features,
)

dataset_test_tokenized = dataset_test.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=dataset_test.features,
)

Map (num_proc=4):   0%|          | 0/147 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/147 [00:00<?, ? examples/s]

Now let's prepare the input data to the moodel. As you can see above, typically the length of the token ids (input_ids) are few hundred tokens long. However, Llama model typically have 2048 or 4096 context window. To use the data more efficiently, we use a technique called packing: instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with a EOS token in between and cut chunks of the context size to fill the batch without any padding.

<img src="https://github.com/nyp-sit/iti107/blob/main/session-7/resources/packing.png?raw=1" width="700"/>


The code below help us find the maximum context window of the model

In [None]:
def get_max_context_length(model):

    conf = model.config
    max_length = None

    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max context lenth: {max_length} in {length_setting}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max context length: {max_length}")

    return max_length

max_context_length = get_max_context_length(model)
print('Maximum Context length: ', max_context_length)

Found max context lenth: 4096 in max_position_embeddings
Maximum Context length:  4096


The following functions concatenate a batch of samples, and then divide the concatenated sample into chunks of context size.  Also we also need to create 'labels' in the input dataset, which tells the model what is the token to be predicted.  Shifting the inputs and labels to align them happens inside the model, so our labels are just the exact copy of the input_ids.

In the code below, we use a context_length of 512 instad of the maximum 4096, as we have limited gpu memory and using a larger context length will result in Out of Memory error even with batch size of 1.

In [None]:
context_length = 512
# context_length = max_context_length

def group_texts(examples):

    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= context_length:
        total_length = (total_length // context_length) * context_length
    # Split by chunks of context length.
    result = {
        k: [t[i : i + context_length] for i in range(0, total_length, context_length)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


In [None]:
dataset_train_final = dataset_train_tokenized.map(group_texts, batched=True, num_proc=4)
dataset_val_final = dataset_val_tokenized.map(group_texts, batched=True, num_proc=4)
dataset_test_final = dataset_test_tokenized.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/684 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/147 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/147 [00:00<?, ? examples/s]

Now let's examine the dataset_train_final and we can see that all the samples are of lenghth equal to the specified context window.

In [None]:
dataset_train_final

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 246
})

In [None]:
dataset_val_final

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 50
})

In [None]:
for sample in dataset_train_final['input_ids'][:5]:
    print(len(sample))

512
512
512
512
512


Since we have done all the heavy lifting of preprocessing the data in our codes, we just use a simple default data collator which basically just pass the dictionary-like input to the model.

In [None]:
data_collator = default_data_collator

## Setup the PEFT/LoRA model for Fine-Tuning

You need to set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank $r$ hyper-parameter, which defines the rank/dimension of the adapter to be trained.


In [None]:
model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_kbit_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )

    # prepare int-8 model for training
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)


trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


If you look at the trainable prarameters, there are only about 4 million parameters, comparaed to about 6.7 billion parameters of the entire model.

In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear8bitLt(
                (base_layer): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=F

## Define the Trainer and Training Arguments

We can now define training arguments and create Trainer instance. If you are using Ampere GPU (e.g. NVIDIA A10), then you can set bf16 to True to use bfloat16 for mixed precision computation.

*Note: Due to long training time (approximately 1 to 2 hours) to fine-tune the model for it to have decent performance, for this lab, we just train for a single step due to time constraint. If you have access to GPUs such a A10G or others, you can train for more steps e.g. 100 steps, and set the logging_steps=10 and save_steps=10 to log and save every 10 steps.*

In [None]:
# specify where to write the checkpoint to
output_dir = "train_out_dir"

# Define training args
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    auto_find_batch_size=False,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    bf16=False,  # Use BF16 if available (e.g. on Ampere GPU)
    # logging strategy
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    # logging_steps=10,
    logging_steps=10,
    # saving strategy
    save_strategy="steps",
    #save_steps=10,
    save_steps=10,
    evaluation_strategy ='steps',
    optim="adamw_torch_fused",
    load_best_model_at_end=True,
    max_steps=200
)

 # Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train_final,
    eval_dataset=dataset_val_final,
    data_collator=data_collator,
)


In [None]:
# Start training
trainer.train()



Step,Training Loss,Validation Loss
10,1.6813,1.631856
20,1.583,1.498535
30,1.4138,1.393412
40,1.3603,1.308431
50,1.2711,1.237813
60,1.2255,1.179109
70,1.1686,1.139814
80,1.0829,1.112842
90,1.114,1.087292
100,1.0772,1.062716




TrainOutput(global_step=200, training_loss=1.100144896507263, metrics={'train_runtime': 3605.5794, 'train_samples_per_second': 0.222, 'train_steps_per_second': 0.055, 'total_flos': 1.6248515592192e+16, 'train_loss': 1.100144896507263, 'epoch': 3.25})

In [None]:
model.eval()
trainer.evaluate(eval_dataset=dataset_val_final)



{'eval_loss': 0.8671211004257202,
 'eval_runtime': 28.5195,
 'eval_samples_per_second': 1.753,
 'eval_steps_per_second': 0.877,
 'epoch': 3.25}

### Save the Trained model

In [None]:
save_dir = gpath+'lora_model_output'
model.save_pretrained(save_dir)


### Load the PEFT Model

Uncomment the following to download fine-tuned LoRA weights.

You should **restart the session to clear the GPU memory** before continuning with the next step.



---


##Adding lora weights to the model

---
this is section is to load and apply pre-train lora weights instead of re-training with each new colab session


In [None]:
from google.colab import drive
drive.mount('/content/drive')
gpath="/content/drive/MyDrive/Colab_Notebooks/NYP_AIML/shared_it110/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import re
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
#from transformers import AutoTokenizer, AutoModelForCausalLM
#from transformers import default_data_collator, Trainer, TrainingArguments
from datasets import load_dataset
import evaluate
from peft import LoraConfig, PeftModel, get_peft_model

In [None]:
model_id = 'meta-llama/Llama-2-7b-hf'
save_dir = gpath+'lora_model_output_base_llama2_sum_mcq'
new_model = "apailang/llama2_7b_sum_mcq_3"

# Reload model in FP16 and merge it with LoRA weights
base_model = LlamaForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="cuda:0",
)
model = PeftModel.from_pretrained(base_model, save_dir)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = LlamaTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
model.push_to_hub("apailang/llama2_7b_sum_mcq_3", max_shard_size='2GB')
tokenizer.push_to_hub("apailang/llama2_7b_sum_mcq_3")

model-00001-of-00007.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00005-of-00007.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

model-00006-of-00007.safetensors:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

Upload 7 LFS files:   0%|          | 0/7 [00:00<?, ?it/s]

model-00003-of-00007.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00004-of-00007.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00002-of-00007.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00007-of-00007.safetensors:   0%|          | 0.00/1.66G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/apailang/llama2_7b_sum_mcq_3/commit/32f4dd5fd1e17b074ec0b57e99b965a077b43723', commit_message='Upload tokenizer', commit_description='', oid='32f4dd5fd1e17b074ec0b57e99b965a077b43723', pr_url=None, pr_revision=None, pr_num=None)



---



In [None]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_



---


#To start using model

---



### Test the Model

Now let's test our fine-tuned model on the same prompt.

In [None]:
tokenizer = LlamaTokenizer.from_pretrained("apailang/llama2_7b_sum_mcq_3")
model = LlamaForCausalLM.from_pretrained("apailang/llama2_7b_sum_mcq_3", load_in_8bit=True, device_map='cuda:0', use_cache=False) #'cuda:0'


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
eval_prompt = """
Create an Multiple choice question:
Artificial neural networks are built on the principles of the structure and operation of human neurons.
It is also known as neural networks or neural nets. An artificial neural network\u2019s input layer, which is the first layer, receives input from external sources and passes it on to the hidden layer, which is the second layer.
Each neuron in the hidden layer gets information from the neurons in the previous layer, computes the weighted total, and then transfers it to the neurons in the next layer.
These connections are weighted, which means that the impacts of the inputs from the preceding layer are more or less optimized by giving each input a distinct weight.
These weights are then adjusted during the training process to enhance the performance of the model.
Artificial neurons, also known as units, are found in artificial neural networks.
The whole Artificial Neural Network is composed of these artificial neurons, which are arranged in a series of layers.
The complexities of neural networks will depend on the complexities of the underlying patterns in the dataset whether a layer has a dozen units or millions of units.
Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden layers.
The input layer receives data from the outside world which the neural network needs to analyze or learn about.
In a fully connected artificial neural network, there is an input layer and one or more hidden layers connected one after the other.
Each neuron receives input from the previous layer neurons or the input layer. The output of one neuron becomes the input to other neurons in the next layer of the network, and this process continues until the final layer produces the output of the network.
Then, after passing through one or more hidden layers, this data is transformed into valuable data for the output layer. Finally, the output layer provides an output in the form of an artificial neural network\u2019s response to the data that comes in.

---
question:
options A:
options B:
options C:
options D:
correct_answer:
explanation:
"""


from transformers import TextStreamer

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

# #Streaming support
# streamer = TextStreamer(tokenizer)
# peft_model.generate(**model_input, streamer=streamer)
model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input)[0], skip_special_tokens=True))


Create an Multiple choice question:
Artificial neural networks are built on the principles of the structure and operation of human neurons.
It is also known as neural networks or neural nets. An artificial neural network’s input layer, which is the first layer, receives input from external sources and passes it on to the hidden layer, which is the second layer.
Each neuron in the hidden layer gets information from the neurons in the previous layer, computes the weighted total, and then transfers it to the neurons in the next layer.
These connections are weighted, which means that the impacts of the inputs from the preceding layer are more or less optimized by giving each input a distinct weight.
These weights are then adjusted during the training process to enhance the performance of the model.
Artificial neurons, also known as units, are found in artificial neural networks.
The whole Artificial Neural Network is composed of these artificial neurons, which are arranged in a series of 

[INST] Create an MCQ on the structure and operation of artificial neural networks. [/INST] [question]: What is the purpose of an artificial neural network's input layer? [option A]: To receive input from external sources. [option B]: To compute the weighted total of input from the previous layer. [option C]: To pass on input to the hidden layer. [option D]: To optimize the weights of connections. [correct_answer]: A, [explanation]:The input layer in an artificial neural network receives input from external sources and passes it on to the hidden layer.


## Evaluate the model using ROUGE metric

We first define some utility function to extract the expected_output part from the dataset

In [None]:
# remove the dialog and retain only text in the summary
def get_summary(text):
    parts = re.split(r'expected_output:', text)
    summary = parts[1].strip()
    return summary

The test set has 147 entries.

In [None]:
from datasets import Dataset, DatasetDict, load_dataset


In [None]:
dataset_name = "BitBasher/mini-dataset-978"
dataset = load_dataset(dataset_name)

ds_train_devtest = dataset['train'].train_test_split(test_size=0.3, seed=42)
ds_devtest = ds_train_devtest['test'].train_test_split(test_size=0.5, seed=42)

ds_splits = DatasetDict({
    'train': ds_train_devtest['train'],
    'valid': ds_devtest['train'],
    'test': ds_devtest['test']
})

# Define the key phrases to filter for
key_phrases = ["summarize", "mcq"]
# Filter the dataset using a list comprehension
filtered_records = [
    record
    for record in ds_splits['test']
    if any(phrase in record["instruction"].lower() for phrase in key_phrases)
]

# Split the filtered records into separate sets based on key phrases
summarize_records = [record for record in filtered_records if "summarize" in record["instruction"].lower()]
mcq_records = [record for record in filtered_records if "mcq" in record["instruction"].lower()]

# Print or utilize the filtered sets
print("Summarize records:", summarize_records)
print("MCQ records:", mcq_records)

Summarize records: [{'expected_output': 'The challenges in deep learning include data availability, computational resources, time-consuming training, interpretability, and overfitting. Deep learning models require large amounts of data for training, specialized hardware for computation, and can be difficult to interpret. Additionally, overfitting can occur when the model becomes too specialized for the training data.', 'instruction': 'Summarize the challenges in deep learning', 'input_content': 'Deep learning has made significant advancements in various fields, but there are still some challenges that need to be addressed. Here are some of the main challenges in deep learning:'}, {'expected_output': 'In XGBoost, the learning_rate determines the step size taken by the optimizer during each iteration. The n_estimators determines the number of boosting trees to be trained. The max_depth determines the maximum depth of each tree in the ensemble. The min_child_weight determines the minimum 

In [None]:
print(len(summarize_records))
print(len(mcq_records))

49
70


In [None]:
summarize_records = Dataset.from_list(summarize_records)
mcq_records = Dataset.from_list(mcq_records)



---


###Evaluate for summarize capbilites

---



In [None]:
dialogues = summarize_records['input_content'][:5]
human_baseline_summaries = summarize_records['expected_output'][:5]

print(dialogues)
print(human_baseline_summaries)

['Deep learning has made significant advancements in various fields, but there are still some challenges that need to be addressed. Here are some of the main challenges in deep learning:', 'Hyperparameters in XGBoost include learning_rate, n_estimators, max_depth, min_child_weight, and subsample.', 'Deep Learning is a type of Machine Learning that uses artificial neural networks with multiple layers to learn and make decisions. Here are some examples of Deep Learning applications:', 'Deep learning has made significant advancements in various fields, but there are still some challenges that need to be addressed. Here are some of the main challenges in deep learning:', 'Artificial Intelligence (AI) is the broader family consisting of Machine Learning (ML) and Deep Learning (DL). ML is a subset of AI, while DL is a subset of ML. AI focuses on mimicking human behavior through algorithms, ML enables systems to learn from data, and DL uses neural networks to analyze data and provide output.'

In [None]:
peft_model_summaries = []

for _, dialogue in enumerate(dialogues):
    eval_prompt = f"""
Summarize:
{dialogue}
---
expected_output:
"""
    model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        peft_model_output = tokenizer.decode(model.generate(**model_input)[0], skip_special_tokens=True)
    summary = get_summary(peft_model_output)
    peft_model_summaries.append(summary)

In [None]:
rouge = evaluate.load('rouge')

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
print('PEFT model ROUGE scores:')
print(peft_model_results)

PEFT model ROUGE scores:
{'rouge1': 0.3283277139378782, 'rouge2': 0.1885538402735263, 'rougeL': 0.2823365839277277, 'rougeLsum': 0.28328511621432834}


In [None]:
print('Human Baseline')
print('*'*10)
for summary in human_baseline_summaries[:5]:
    print(summary)
print('PEFT summaries')
print('*'*10)
for summary in peft_model_summaries[:5]:
    print(summary)

Human Baseline
**********
The challenges in deep learning include data availability, computational resources, time-consuming training, interpretability, and overfitting. Deep learning models require large amounts of data for training, specialized hardware for computation, and can be difficult to interpret. Additionally, overfitting can occur when the model becomes too specialized for the training data.
In XGBoost, the learning_rate determines the step size taken by the optimizer during each iteration. The n_estimators determines the number of boosting trees to be trained. The max_depth determines the maximum depth of each tree in the ensemble. The min_child_weight determines the minimum sum of instance weight needed in a child node. The subsample determines the percentage of rows used for each tree construction.
Deep Learning is a subset of Machine Learning that utilizes artificial neural networks with multiple layers to learn and make decisions. It has various applications, including 



---


###Evaluate for MCQ capbilites

---



In [None]:
dialogues = mcq_records['input_content'][5:10]
human_baseline_summaries = mcq_records['expected_output'][5:10]

print(dialogues)
print(human_baseline_summaries)

['Applications of Deep Learning : The main applications of deep learning can be divided into computer vision, natural language processing (NLP), and reinforcement learning. Reinforcement learning: In reinforcement learning , deep learning works as training agents to take action in an environment to maximize a reward. Some of the main applications of deep learning in reinforcement learning include: ● Game playing: Deep reinforcement learning models have been able to beat human experts at games such as Go, Chess, and Atari. ● Robotics: Deep reinforcement learning models can be used to train robots to perform complex tasks such as grasping objects, navigation, and manipulation. ● Control systems: Deep reinforcement learning models can be used to control complex systems such as power grids, traffic management, and supply chain optimization.', 'Deep Learning has achieved significant success in various fields, including image recognition, natural language processing, speech recognition, and 

In [None]:
peft_model_summaries = []

for _, dialogue in enumerate(dialogues):
    eval_prompt = f"""
create Multiple Choices Questions (MCQ):
:
{dialogue}
---
expected_output:
"""
    model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        peft_model_output = tokenizer.decode(model.generate(**model_input)[0], skip_special_tokens=True)
    summary = get_summary(peft_model_output)
    peft_model_summaries.append(summary)

In [None]:
rouge = evaluate.load('rouge')

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
print('PEFT model ROUGE scores:')
print(peft_model_results)

PEFT model ROUGE scores:
{'rouge1': 0.3529915828959201, 'rouge2': 0.21045322444486483, 'rougeL': 0.2598666825032208, 'rougeLsum': 0.2534343222392257}


In [None]:
print('Human Baseline')
print('*'*10)
for summary in human_baseline_summaries[:5]:
    print(summary)
print('PEFT summaries')
print('*'*10)
for summary in peft_model_summaries[:5]:
    print(summary)

Human Baseline
**********
[question]: What is one of the main applications of deep learning in reinforcement learning? [option A]: Sentiment analysis. [option B]: Image segmentation. [option C]: Game playing. [option D]: Speech recognition. [correct_answer]: C, [explanation]:One of the main applications of deep learning in reinforcement learning is game playing. Deep reinforcement learning models have been able to beat human experts at games such as Go, Chess, and Atari.
[question]: Which of the following is an application of deep learning? [option A]: Database management. [option B]: Speech recognition. [option C]: Statistical analysis. [option D]: Data visualization. [correct_answer]: B, [explanation]:An application of deep learning is speech recognition. Deep learning has achieved significant success in various fields, including image recognition, natural language processing, speech recognition, and recommendation systems.
[question]: What is the relationship between Artificial Inte



---


### Testing the token generation speed


---



In [None]:
import transformers
import time

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="cuda:0",
)

In [None]:
tokens_per_second_list = []

for i in range(20):
    start = time.time()
    output = pipeline(eval_prompt, max_new_tokens=30, temperature=1, top_k=1, top_p=0.90)

    delay = time.time()
    total_time = (delay - start)
    time_per_token = total_time / 30

    # Calculate tokens per second
    tokens_per_second = 30 / total_time
    tokens_per_second_list.append(tokens_per_second)


average = sum(tokens_per_second_list) / len(tokens_per_second_list)
# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(average))



Total inference time: 8.23 ms
Time per token: 0.27 ms/token
Tokens per second: 3.75 token/s
