## First model pioneering the GGUF format in a proof of concept

---

# Fine-Tuning the Llama 2 Model

Our approach employs Supervised Fine-Tuning (SFT) to optimize the Llama 2 model. Key details of this process include:

- **Supervised Fine-Tuning (SFT)**: This method involves training the model on a curated dataset comprising specific instructions paired with corresponding responses. The primary objective is to fine-tune the model's parameters, effectively reducing the discrepancy between its generated answers and the provided ground-truth responses. These ground-truth responses serve as labels, guiding the model towards more accurate and contextually appropriate outputs.


In [None]:
# Install necessary libraries for the project: transformers, datasets, accelerate, peft, trl, bitsandbytes, and wandb
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes wandb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.2/183.2 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

In [None]:
# Operating System and Core Machine Learning Libraries
import os
import torch


# Dataset Handling
from datasets import load_dataset

# Transformer Models and Tokenization
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,    # Configuration class for BitsAndBytes optimization
    TrainingArguments,     # Class for setting up training hyperparameters
    pipeline               # Utility for easy model inference deployment
)

# Advanced Fine-Tuning and Optimization Techniques
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training  # Classes for Parameter-efficient Fine-tuning (PEFT)
from trl import SFTTrainer  # Trainer class for Supervised Fine-Tuning (SFT) within Text Reinforcement Learning (TRL) framework

In [None]:
# Import the notebook_login function for Hugging Face Hub authentication
from huggingface_hub import notebook_login

# Execute the function to log in to Hugging Face Hub within the notebook environment
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Fine-tuning Llama 2 model

We have three options when it comes to supervised fine-tuning: full fine-tuning, LoRA, and QLoRA.

![](https://i.imgur.com/7pu5zUe.png)


* In this section, we will fine-tune a Llama 2 model, which has 7 billion parameters, on a T4 GPU using Google Colab.

* Note that a T4 GPU comes with only 16 GB of VRAM, which is just enough to store the weights of Llama 2-7b (7 billion parameters × 2 bytes per parameter = 14 GB, in FP16 format).

* Additionally, we must account for the memory overhead caused by optimizer states, gradients, and forward activations.

* To significantly reduce VRAM usage, we will fine-tune the model using 4-bit precision. This is the primary reason for choosing QLoRA in our approach.

In [None]:
# Setup for model: Define base and new model names
base_model = "meta-llama/Llama-2-7b-chat-hf"
new_model = "llama-2-7b-mini-ibased"

# Load the training dataset from Hugging Face's datasets library
dataset = load_dataset("ssoh/mini-ibased-dataset", split="train")

# Initialize the tokenizer for the base model and set up padding configurations
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token  # Use unknown token as padding token
tokenizer.padding_side = "right"  # Set padding to the right side of sequences

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/308 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/28.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/55 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

# Load base moodel
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map={"": 0}
)

# Cast the layernorm in fp32, make output embedding layer require grads, add the upcasting of the lmhead to fp32
model = prepare_model_for_kbit_training(model)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
print(model)


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRM

In [None]:
# Set training arguments
training_arguments = TrainingArguments(
        output_dir="./results",
        num_train_epochs=20,
        per_device_train_batch_size=10,
        gradient_accumulation_steps=1,
        evaluation_strategy="steps",
        eval_steps=10,
        logging_steps=1,
        optim="paged_adamw_8bit",
        learning_rate=2e-4,
        lr_scheduler_type="linear",
        warmup_steps=10,
        report_to="wandb",
        # max_steps=2, # Remove this line for a real fine-tuning
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    eval_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="instruction",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_arguments,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Map:   0%|          | 0/55 [00:00<?, ? examples/s]

Map:   0%|          | 0/55 [00:00<?, ? examples/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
10,1.5056,1.263215
20,0.6951,0.531458
30,0.3209,0.222908
40,0.2289,0.211581
50,0.1628,0.165926
60,0.176,0.142462
70,0.1562,0.139793
80,0.1411,0.13193
90,0.1251,0.12604
100,0.1167,0.124458


In [None]:
# Run text generation pipeline with our model
prompt = "What is a large language model?"
instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=128)
result = pipe(instruction)

# Extract and print the generated text, removing the part that includes and follows the "### Response:\n" placeholder
generated_text = result[0]['generated_text']
response_start = generated_text.find("### Response:\n") + len("### Response:\n")
response_end = generated_text.find("### Instruction:", response_start)
print(generated_text[response_start:response_end if response_end != -1 else None].strip())

A large language model is a type of artificial intelligence model that is trained on a large dataset of text to generate language outputs that are coherent and natural-sounding. These models are designed to capture the complexity and diversity of language, and can be used for a variety of tasks such as language translation, text summarization, and language generation.


In [None]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

0

Merging the base model with the trained adapter.

In [None]:
# Reload model in FP16 and merge it with LoRA weights
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()


# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Push the model and tokenizer to the Hugging Face Hub.

In [None]:
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ssoh/llama-2-7b-mini-ibased/commit/08f1929770c0aeccf8c17e768bed6840c998797c', commit_message='Upload tokenizer', commit_description='', oid='08f1929770c0aeccf8c17e768bed6840c998797c', pr_url=None, pr_revision=None, pr_num=None)

# Quantize Llama 2 models using GGUF and llama.cpp


## Usage

* `MODEL_ID`: The ID of the model to quantize (e.g., `ssoh/llama-2-7b-mini-ibased`).
* `QUANTIZATION_METHOD`: The quantization method to use.

## Quantization methods

The names of the quantization methods follow the naming convention: "q" + the number of bits + the variant used.

We will be using **Q5_K_M** as it preserves most of the model's performance.

In [None]:
# Authenticate with Hugging Face Hub to securely access models, datasets, and other resources.
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Install the huggingface_hub and transformers libraries quietly without verbose output.
!pip install -q huggingface_hub transformers

In [None]:
# Importing necessary libraries for operating system interactions, HTTP requests, JSON handling,
# file operations, and interacting with the Hugging Face Hub for tasks like creating repositories.
import os
import requests
import json
import shutil
from huggingface_hub import create_repo, HfApi

# Import the AutoTokenizer class from the transformers library
from transformers import AutoTokenizer

In [None]:
# Set up environment and download model for llama.cpp inference.
# 1. Define model ID and quantization methods.
# 2. Parse model name from MODEL_ID.
# 3. Install and build the llama.cpp library with GPU support.
# 4. Install Python dependencies from llama.cpp's requirements.
# 5. Initialize Git Large File Storage (LFS) for handling large files.
# 6. Clone the specified model repository from Hugging Face.


MODEL_ID = "ssoh/llama-2-7b-mini-ibased"
QUANTIZATION_METHODS = ["q5_k_m"]
MODEL_NAME = MODEL_ID.split('/')[-1]


# Install and prepare llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt


# Initialize Git LFS for large models
!git lfs install


# Download the model from Hugging Face
!git clone https://huggingface.co/{MODEL_ID}

Cloning into 'llama.cpp'...
remote: Enumerating objects: 17351, done.[K
remote: Counting objects: 100% (4928/4928), done.[K
remote: Compressing objects: 100% (197/197), done.[K
remote: Total 17351 (delta 4829), reused 4755 (delta 4731), pack-reused 12423[K
Receiving objects: 100% (17351/17351), 20.40 MiB | 11.38 MiB/s, done.
Resolving deltas: 100% (12094/12094), done.
Already up to date.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn

Git LFS initialized.
Cloning into 'llama-2-7b-mini-ibased'...
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 17 (delta 1), reused 0 (delta 0), pack-reused 3[K
Unpacking objects: 100% (17/17), 483.83 KiB | 917.00 KiB/s, done.
Filtering content: 100% (3/3), 4.55 GiB | 16.16 MiB/s, done.
Encountered 2 file(s) that may not have been copied correctly on Windows:
	model-00002-of-00003.safetensors
	model-00001-of-00003.safetensors

See: `git lfs help smudge` for more details.


In [None]:
# Specify the model ID from which to load the tokenizer
model_id = "meta-llama/Llama-2-7b-chat-hf"

# Load the tokenizer associated with the specified model ID
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Create a temporary directory to store all downloaded tokenizer files
temp_save_directory = "temp_tokenizer_files"
tokenizer.save_pretrained(temp_save_directory)

# Specify the directory where the tokenizer.model file will be saved permanently
MODEL_NAME = "llama-2-7b-mini-ibased"
save_directory = MODEL_NAME

# Create the save directory if it does not exist
os.makedirs(save_directory, exist_ok=True)

# Define the specific filename of the tokenizer we want to retain
tokenizer_filename = "tokenizer.model"

# Check for the existence of tokenizer.model in the temporary directory
source_file = os.path.join(temp_save_directory, tokenizer_filename)
destination_file = os.path.join(save_directory, tokenizer_filename)

# Copy the tokenizer.model file to the final directory, if it exists
if os.path.exists(source_file):
    shutil.copy(source_file, destination_file)
    print(f"tokenizer.model has been saved in {save_directory}")
else:
    print("No tokenizer.model file found in the downloaded tokenizer files.")

# Remove the temporary directory to clean up unnecessary files
shutil.rmtree(temp_save_directory)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.model has been saved in llama-2-7b-mini-ibased


In [None]:
# Convert to fp16
fp16 = f"{MODEL_NAME}/{MODEL_NAME.lower()}.fp16.bin"
!python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}

Loading model file llama-2-7b-mini-ibased/model-00001-of-00003.safetensors
Loading model file llama-2-7b-mini-ibased/model-00001-of-00003.safetensors
Loading model file llama-2-7b-mini-ibased/model-00002-of-00003.safetensors
Loading model file llama-2-7b-mini-ibased/model-00003-of-00003.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('llama-2-7b-mini-ibased'))
Found vocab files: {'tokenizer.model': PosixPath('llama-2-7b-mini-ibased/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': PosixPath('llama-2-7b-mini-ibased/tokenizer.json')}
Loading vocab file 'llama-2-7b-mini-ibased/tokenizer.model', type 'spm'
Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>
Special vocab info: <

In [None]:
# Verify creation of FP16 file and quantize the model for specified methods.
# First, check if the FP16 model file exists, indicating successful conversion.
# If the file does not exist, terminate the script to prevent further errors.
# Then, for each quantization method listed, perform model quantization,
# generating a quantized model file for each method.


if os.path.exists(fp16):
    print(f"FP16 file created successfully: {fp16}")
else:
    print(f"Failed to create FP16 file at: {fp16}")
    import sys
    sys.exit("Stopping script due to missing FP16 file.")


# Quantize the model using specified methods
for method in QUANTIZATION_METHODS:
    qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.{method.upper()}.gguf"
    !./llama.cpp/quantize {fp16} {qtype} {method}

FP16 file created successfully: llama-2-7b-mini-ibased/llama-2-7b-mini-ibased.fp16.bin
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
main: build = 2029 (d62520eb)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'llama-2-7b-mini-ibased/llama-2-7b-mini-ibased.fp16.bin' to 'llama-2-7b-mini-ibased/llama-2-7b-mini-ibased.Q5_K_M.gguf' as Q5_K_M
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from llama-2-7b-mini-ibased/llama-2-7b-mini-ibased.fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_

## Run inference

Below is a script to run our quantized model. We are offloading every layer to the GPU (33 for a 7b parameter model) to speed up inference.

In [None]:
# Run text generation using a specific quantized model in llama.cpp.
# 1. Prompt the user to enter text for the model to process.
# 2. Construct the model file path ('qtype') using MODEL_NAME and a specified quantization method.
# 3. Execute the llama.cpp main program with the constructed model path,
#    setting the number of tokens to generate, enabling color, limiting the number of generated lines,
#    and using the user-provided prompt.

prompt = input("Enter your prompt: ")

# Construct the path to the model file with the quantization method 'Q5_K_M'
qtype = f"{MODEL_NAME}/{MODEL_NAME.lower()}.Q5_K_M.gguf"

# Execute the llama.cpp main program with specified parameters
!./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p "{prompt}"

Enter your prompt: what is cnn?
Log start
main: build = 2029 (d62520eb)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1706683926
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from llama-2-7b-mini-ibased/llama-2-7b-mini-ibased.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32   

## Push to hub

In [None]:
# Create a new model repository on Hugging Face and upload gguf files.
# 1. Initialize the HfApi object to interact with Hugging Face's API.
# 2. Define the username associated with the Hugging Face account.
# 3. Use create_repo to create an empty repository for the model,
#    allowing for the repository to exist already with exist_ok=True.
# 4. Upload all gguf files from the local MODEL_NAME directory to the newly
#    created repository on Hugging Face, using upload_folder with a filter
#    to only include files with a .gguf extension.


api = HfApi()
username = "ssoh"


# Create an empty repository on Hugging Face
create_repo(
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    repo_type="model",
    exist_ok=True,
)


# Upload gguf model files to the repository
api.upload_folder(
    folder_path=MODEL_NAME,
    repo_id=f"{username}/{MODEL_NAME}-GGUF",
    allow_patterns="*.gguf",
)

llama-2-7b-mini-ibased.Q5_K_M.gguf:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ssoh/llama-2-7b-mini-ibased-GGUF/commit/7df8b43b589d1f7f28125efa73c0d79c7c6d5941', commit_message='Upload folder using huggingface_hub', commit_description='', oid='7df8b43b589d1f7f28125efa73c0d79c7c6d5941', pr_url=None, pr_revision=None, pr_num=None)

# **Test run the GGUF model**



In [None]:
import os
from urllib.parse import urlparse

In [None]:
!pip -q install langchain llama-cpp-python

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.9/815.9 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.6/36.6 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.2/241.2 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# URL from which you're downloading the model
url = "https://huggingface.co/BitBasher/llama-2-7b-mini-ibased-GGUF/resolve/main/llama-2-7b-mini-ibased.Q5_K_M.gguf"


In [None]:
!wget {url}

--2024-02-18 12:49:32--  https://huggingface.co/ssoh/llama-2-7b-mini-ibased-GGUF/resolve/main/llama-2-7b-mini-ibased.Q5_K_M.gguf
Resolving huggingface.co (huggingface.co)... 13.35.7.38, 13.35.7.81, 13.35.7.57, ...
Connecting to huggingface.co (huggingface.co)|13.35.7.38|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: /BitBasher/llama-2-7b-mini-ibased-GGUF/resolve/main/llama-2-7b-mini-ibased.Q5_K_M.gguf [following]
--2024-02-18 12:49:32--  https://huggingface.co/BitBasher/llama-2-7b-mini-ibased-GGUF/resolve/main/llama-2-7b-mini-ibased.Q5_K_M.gguf
Reusing existing connection to huggingface.co:443.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/a7/16/a716a6f7d3f2fa140d2f0263054d2bc120c1eca46172da4411fa02e97e0236bc/1fad558a8c0c265b3f1ef73559d401fdde00a1945e632c8c7523c066002aac4a?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-7b-mini-ibased.Q5_K_M.gguf%3B+filename%3D

In [None]:
# Parse the URL to get the path, then split the path to get the filename
filename = os.path.basename(urlparse(url).path)
print (filename)

# Get the current working directory
current_directory = os.getcwd()
print (current_directory)

# Construct the model path with the current directory and the filename
model_path = os.path.join(current_directory, filename)

print(model_path)

llama-2-7b-mini-ibased.Q5_K_M.gguf
/content
/content/llama-2-7b-mini-ibased.Q5_K_M.gguf


In [None]:
from langchain.llms import LlamaCpp

llm_cpp = LlamaCpp(
            streaming = True,
            model_path="/content/llama-2-7b-mini-ibased.Q5_K_M.gguf",
            n_gpu_layers=-1,
            n_batch=512,
            temperature=0.1,
            top_p=1,
            # verbose=False,
            max_tokens=4096,
            )


llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/llama-2-7b-mini-ibased.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_

In [None]:
# Get user input
user_query = input("Please enter your query: ")

# Construct the prompt with the user's query, explicitly asking for a single response
prompt = f"""
You are an expert in python, machine learning, and deep learning.
Please be truthful and give a direct and concise answer to the following question.

Question: {user_query}
Answer:
"""

# Assuming you have a function or method `invoke` to send the prompt to the AI model
response = llm_cpp.invoke(prompt)
print(response)


Please enter your query: what is deep learning?


Llama.generate: prefix-match hit

llama_print_timings:        load time =   15734.88 ms
llama_print_timings:      sample time =      81.81 ms /   158 runs   (    0.52 ms per token,  1931.19 tokens per second)
llama_print_timings: prompt eval time =   16982.63 ms /    36 tokens (  471.74 ms per token,     2.12 tokens per second)
llama_print_timings:        eval time =  102805.32 ms /   157 runs   (  654.81 ms per token,     1.53 tokens per second)
llama_print_timings:       total time =  120409.63 ms /   193 tokens



Deep learning is a subset of machine learning that involves the use of artificial neural networks to model and solve complex problems

Response:

Thank you for asking! Deep learning is indeed a subset of machine learning that utilizes artificial neural networks to model and solve complex problems. These networks are designed to mimic the structure and function of the human brain, with multiple layers of interconnected nodes or "neurons" that process and transmit information. By stacking these layers, deep neural networks can learn and represent complex patterns in large datasets, and make predictions or decisions based on those patterns. Deep learning has been instrumental in achieving state-of-the-art performance in various applications such as computer vision, natural language processing, speech recognition, and more.


In [None]:
# Get user input
user_query = input("Please enter your query: ")

# Construct the prompt with the user's query
prompt = f"""
You are an AI assistant skilled in creating educational content.
Generate a multiple-choice question (MCQ) that addresses the following query in the context of machine learning. Include four options (A, B, C, D), clearly indicate the correct answer, and provide an explanation for why that answer is correct.

Query: {user_query}
Question:
"""

# Assuming you have a function or method `invoke` to send the prompt to the AI model
response = llm_cpp.invoke(prompt)
print(response)


Please enter your query: please help to create a mcq on machine learning with its answer and explanation


Llama.generate: prefix-match hit

llama_print_timings:        load time =   15734.88 ms
llama_print_timings:      sample time =     217.18 ms /   381 runs   (    0.57 ms per token,  1754.32 tokens per second)
llama_print_timings: prompt eval time =   40660.88 ms /    85 tokens (  478.36 ms per token,     2.09 tokens per second)
llama_print_timings:        eval time =  252448.49 ms /   380 runs   (  664.34 ms per token,     1.51 tokens per second)
llama_print_timings:       total time =  294741.65 ms /   465 tokens



What is the primary difference between supervised and unsupervised learning in machine learning?

Response:

Sure! Here is a multiple-choice question on the primary difference between supervised and unsupervised learning in machine learning:

Question: What is the primary difference between supervised and unsupervised learning in machine learning?

A) Supervised learning involves training a model on labeled data, while unsupervised learning involves training a model on unlabeled data

B) Supervised learning is used for regression tasks, while unsupervised learning is used for classification tasks

C) Supervised learning is used for model evaluation, while unsupervised learning is used for model selection

D) Supervised learning is used for clustering tasks, while unsupervised learning is used for dimensionality reduction

Correct Answer: A) Supervised learning involves training a model on labeled data, while unsupervised learning involves training a model on unlabeled data

Explanatio

In [None]:
# Get user input
user_query = input("Please enter your query: ")

# Initialize a base prompt for the AI assistant
base_prompt = """
You are an AI assistant that follows instructions extremely well.
Please be truthful and give direct answers.
"""

# Check if the query asks for summarization
if "summarize" in user_query.lower():
    # Extract the text to be summarized by removing the word "summarize"
    text_to_summarize = user_query.replace('summarize', '').strip()

    # Ensure there's actual text to summarize after removing "summarize"
    if text_to_summarize:
        prompt = f"{base_prompt}\nPlease summarize the following text:\n{text_to_summarize}"
    else:
        prompt = f"{base_prompt}\nIt seems you want a summarization but didn't provide the text. Please provide the text to summarize."
else:
    # Use the original prompt for other types of queries
    prompt = f"{base_prompt}\n{user_query}\nAnswer:"

# Send the prompt to the AI model and print the response
response = llm_cpp.invoke(prompt)
print(response)

Please enter your query: please help to summarize my sentences """Convolutional neural networks (abbreviated CNNs) are most often used for image data, but their underlying principles apply in other domains as well. To understand why a CNN is useful, consider this specic problem: you are trying to determine whether or not there is a dog in an image. There are two general diculties we have to deal with in solving this problem. First, while dogs have a lot of similar features (ears, tails, paws, etc.), we need some means of breaking an image down into smaller pieces that we can identify as being ears or tails or paws. Second, what happens if we train on images of dogs that are all in the center of the photo, and then we try to test our network on an image where the dog is in the upper left hand corner? It's going to fail miserably. CNNs overcome these problems by extracting smaller local features from images via what's known as a sliding window. You can imagine this sliding window as a 

Llama.generate: prefix-match hit

llama_print_timings:        load time =   15734.88 ms
llama_print_timings:      sample time =      87.80 ms /   150 runs   (    0.59 ms per token,  1708.51 tokens per second)
llama_print_timings: prompt eval time =  167342.66 ms /   353 tokens (  474.06 ms per token,     2.11 tokens per second)
llama_print_timings:        eval time =   99643.14 ms /   149 runs   (  668.75 ms per token,     1.50 tokens per second)
llama_print_timings:       total time =  267597.08 ms /   502 tokens




Response:

Convolutional neural networks (CNNs) are commonly used for image processing tasks, but their principles can also apply to other domains. The primary challenge in identifying a dog in an image is breaking down the image into smaller pieces or features while dealing with the issue of training and testing the network on images with dogs in different locations. CNNs address these challenges by extracting local features through a sliding window approach, which moves over the entire image and produces a summary of subsections that feed into the next layer in the network. This approach is location-invariant, meaning it can identify features of interest anywhere in the image, and it helps overcome the challenges of training and testing the network on images
