# Used Data and Model

datasets:

*   https://huggingface.co/datasets/heliosbrahma/mental_health_chatbot_dataset
*   https://huggingface.co/datasets/mpingale/mental-health-chat-dataset


base model:

*    https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct


# Dataset Preprocessing

In [14]:
!pip install datasets
!pip install peft



In [1]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from datasets import load_dataset
from huggingface_hub import notebook_login


notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # Llama doesn't have a pad token by default

prompt = ("You are a therapy chatbot, designed to offer emotional support and companionship to users seeking a listening ear. "
        "Your purpose is to engage in conversations that provide comfort, offer insights based on therapeutic principles, and suggest resources when appropriate. "
        "You need to act in a friendly and empathetic manner, ensuring that users feel heard and supported during their interactions with you.")

message_template = [
    {"role": "system", "content": "{}"}, # The Prompt
    {"role": "user", "content": "{}"}, # The Question
    {"role": "assistant", "content": "{}"} # The Answer
]

dataset_1 = load_dataset("heliosbrahma/mental_health_chatbot_dataset", split="train")
dataset_2 = load_dataset("mpingale/mental-health-chat-dataset", split="train")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


I am unsure if this is the format for pre-processing... but well here we go

It could be this: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/14

In [3]:
import re
import torch



def tokenize(example):
    split_text = re.split(r'<HUMAN>:\s*|<ASSISTANT>:\s*', example["text"])
    question = split_text[1].strip()
    answer = split_text[2].strip()
    contents = [prompt, question, answer]
    message = [{"role": template["role"], "content": template["content"].format(content)} for template, content in zip(message_template, contents)]
    message = tokenizer.apply_chat_template(message, return_tensors="pt")
    return {"text":message[0]}


tokenized_dataset_1 = dataset_1.map(tokenize)
print(tokenizer.decode(tokenized_dataset_1["text"][0]))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a therapy chatbot, designed to offer emotional support and companionship to users seeking a listening ear. Your purpose is to engage in conversations that provide comfort, offer insights based on therapeutic principles, and suggest resources when appropriate. You need to act in a friendly and empathetic manner, ensuring that users feel heard and supported during their interactions with you.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is a panic attack?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Panic attacks come on suddenly and involve intense and often overwhelming fear. They’re accompanied by very challenging physical symptoms, like a racing heartbeat, shortness of breath, or nausea. Unexpected panic attacks occur without an obvious cause. Expected panic attacks are cued by external stressors, like phobias. Panic attacks can happen to anyone, but having more than one may be a sign of panic 

Removing Duplicates From Dataset 2

https://discuss.huggingface.co/t/how-can-i-drop-duplicates-on-datasets-module/15369

In [4]:
import pandas as pd


df = pd.DataFrame(dataset_2)
print("Number of unique questions:", df['questionID'].max() + 1) # Because questionID begins at 0

# Sort the DataFrame by 'question' and 'views' in descending order
df_filtered = df.loc[df.groupby('questionID')['views'].idxmax()].reset_index()
print("Question Answer pairs after filter:", df_filtered.shape[0])

Number of unique questions: 940
Question Answer pairs after filter: 940


In [5]:
from datasets import Dataset


dataset_2 = Dataset.from_pandas(df_filtered)


def another_tokenize(example):
    question = example["questionText"]
    answer = example["answerText"]
    contents = [prompt, question, answer]
    message = [{"role": template["role"], "content": template["content"].format(content)} for template, content in zip(message_template, contents)]
    message = tokenizer.apply_chat_template(message, return_tensors="pt")
    return {"text":message[0]}


tokenized_dataset_2 = dataset_2.map(another_tokenize)
print(tokenizer.decode(tokenized_dataset_2["text"][0]))

Map:   0%|          | 0/940 [00:00<?, ? examples/s]

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a therapy chatbot, designed to offer emotional support and companionship to users seeking a listening ear. Your purpose is to engage in conversations that provide comfort, offer insights based on therapeutic principles, and suggest resources when appropriate. You need to act in a friendly and empathetic manner, ensuring that users feel heard and supported during their interactions with you.<|eot_id|><|start_header_id|>user<|end_header_id|>

I have so many issues to address. I have a history of sexual abuse, I’m a breast cancer survivor and I am a lifetime insomniac.    I have a long history of depression and I’m beginning to have anxiety. I have low self esteem but I’ve been happily married for almost 35 years.
   I’ve never had counseling about any of this. Do I have too many issues to address in counseling?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Absolutely not.  I strongly recommending working on on

# Setting Up LoRA

In [6]:
import torch
from peft import LoraConfig, get_peft_model


model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, device_map="auto")

# Freeze the model
for param in model.parameters():
    param.requires_grad = False

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

# LORA adapters
config = LoraConfig(
    r=16, # LoRA rank
    lora_alpha=32, # LoRA scaling
    lora_dropout=0.05,
    bias="none", # are there biases of layers that you want to train?
    modules_to_save=None, # layers to unfreeze and train from the original pre-trained model
    task_type="CAUSAL_LM")

model = get_peft_model(model, config)
model.print_trainable_parameters()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

trainable params: 6,815,744 || all params: 8,037,076,992 || trainable%: 0.0848


# Model Training on Dataset 1

In [7]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling


# Define trainer arguments
trainer_1_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    fp16=True,
    logging_steps=1,
    output_dir="outputs")


# Define trainer
trainer_1 = Trainer(
    model=model,
    args=trainer_1_args,
    train_dataset=tokenized_dataset_1["text"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)


# Train model
model.config.use_cache = False # Supress Warnings, re-enable for inference later
trainer_1.train()


# Save the fine-tuned model
trainer_1.save_model("finetuned_model_1")

Step,Training Loss
1,2.0352
2,2.4403
3,1.9871
4,2.0044
5,2.1092
6,1.6123
7,1.6103
8,2.0087
9,1.3335
10,1.5568


# Model Training on Dataset 2

In [10]:
# Define trainer arguments
trainer_2_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    fp16=True,
    logging_steps=5,
    output_dir="outputs")


# Define trainer
trainer_2 = Trainer(
    model=model,
    args=trainer_2_args,
    train_dataset=tokenized_dataset_2["text"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)


# Train model
model.config.use_cache = False # Supress Warnings, re-enable for inference later
trainer_2.train()


# Save the fine-tuned model
trainer_2.save_model("finetuned_model_2")

Step,Training Loss
5,1.9379
10,1.9289
15,1.85
20,1.9221
25,1.6738
30,1.7826
35,1.9236
40,1.8055
45,1.8843
50,1.8865


# Upload To HuggingFace Hub

In [12]:
model.push_to_hub("John4Blues/Llama-3-8B-Therapy", use_auth_token=True, commit_message="Just A Basic Trained Model")

adapter_model.safetensors:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/John4Blues/Llama-3-8B-Therapy/commit/af765571aaebac3fae1dea710f8f306651de60f9', commit_message='Just A Basic Trained Model', commit_description='', oid='af765571aaebac3fae1dea710f8f306651de60f9', pr_url=None, pr_revision=None, pr_num=None)

# Inferencing

https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct