Edit model card

DALL-E-2024-08-08-05-21-39-An-artistic-representation-for-a-model-card-featuring-an-abstract-and-sty

Model Card for Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0:

Model Details:

Model Description:

  • Finetuned from model: Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-2.0 on teknium/openhermes.
  • We pruned the 4 layers of meta-llama/Meta-Llama-3.1-8B that had the less impact on the performance of the model according to the paper The Unreasonable Ineffectiveness of the Deeper Layers.
  • We have therefore 1.09B parameters less than the foundation model, which means less memory needed, faster training and less latency during inference mode.
  • We then recovered the performance loss induced by the pruning process by fine-tuning (from 0.2642 MMLU-Pro 0-shot to 0.3120), this step is called healing the pruned model.

Upcoming Work:

  • More healing through SFT/DPO/TPO to see if we can get closer to the meta-llama/Meta-Llama-3.1-8B performance (which has an MMLU-Pro 0-shot of 0.3659 vs 0.3120 for our model). (In Progress)
  • Compare the same exact process when applied to meta-llama/LLama-3.1-70B.

Training Details:

model = FastLanguageModel.get_peft_model(
model,
r = 4, 
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",],
lora_alpha = 4,
lora_dropout = 0.05, 
bias = "none",    

use_gradient_checkpointing = "unsloth", 
random_state = 3407,
use_rslora = False,  
loftq_config = None, 
)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "completion",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, 
args = TrainingArguments(
    per_device_train_batch_size = 10,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    max_steps=5000,
    learning_rate = 2e-4,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "cosine",
    seed = 3407,
    output_dir = "outputs_4",
    push_to_hub=True,
    hub_always_push=True,
),
)

Training Data:

teknium/openhermes

Memory and Latency gain (Using Optimum-Benchmark):

Load Mode Memory Metrics

Model Max Global VRAM (MB) Max Process VRAM (MB) Max Reserved VRAM (MB) Max Allocated VRAM (MB)
Llama-3.1-8B 18521.98 16630.42 16196.30 16060.54
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 16319.97 14428.41 13994.30 13879.42

Inference Mode Latency Metrics

Model Latency Mean (s) Throughput (tokens/s)
Llama-3.1-8B 0.8104 38.2536
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 0.5530 56.0570

Evaluation:

  • (Foundation model) MMLU Pro 0-shot of meta-llama/Meta-Llama-3.1-8B: 0.3659
  • (Pruned model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers: 0.2642
  • (Healed model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0: 0.3120

Screenshot-2024-08-08-at-7-41-26-AM

Evaluation Data and Process:

Additional Benchmark Results

BoolQ 0-shots Benchmark Results

Model Average Score boolq (0 shots) boolq contrastset (0 shots)
meta-llama/Meta-Llama-3.1-8B 0.569 0.569 0.568
Na0s/Llama-3.1-8B-Pruned-4-Layers 0.240 0.240 0.240
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 0.833 0.834 0.831

BigBench 0-shots Benchmark Results

Model Average Score bigbench:causal_judgment (0 shots) bigbench:date_understanding (0 shots) bigbench:disambiguation_qa (0 shots) bigbench:geometric_shapes (0 shots) bigbench:logical_deduction (0 shots) ...
meta-llama/Meta-Llama-3.1-8B 0.351 0.574 0.499 0.302 0.164 0.208 ...
Na0s/Llama-3.1-8B-Pruned-4-Layers 0.299 0.537 0.341 0.314 0.200 0.212 ...
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 0.364 0.579 0.610 0.407 0.264 0.208 ...

Few Shots Benchmark Results

Model Average Score arc:challenge (25 shots) hellaswag (10 shots) mmlu:abstract_algebra (5 shots) mmlu:college_chemistry (5 shots) mmlu:college_computer_science (5 shots) mmlu:college_mathematics (5 shots) ...
meta-llama/Meta-Llama-3.1-8B 0.552 0.541 0.620 0.290 0.450 0.480 0.350 ...
Na0s/Llama-3.1-8B-Pruned-4-Layers 0.516 0.462 0.549 0.290 0.440 0.460 0.280 ...
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 0.544 0.479 0.554 0.340 0.480 0.520 0.350 ...

BigBench 3-shots Benchmark Results

Model Average Score bigbench:causal_judgment (3 shots) bigbench:date_understanding (3 shots) bigbench:disambiguation_qa (3 shots) bigbench:geometric_shapes (3 shots) bigbench:logical_deduction (3 shots) ...
meta-llama/Meta-Llama-3.1-8B 0.442 0.563 0.596 0.593 0.181 0.298 ...
Na0s/Llama-3.1-8B-Pruned-4-Layers 0.420 0.563 0.642 0.574 0.217 0.258 ...
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 0.450 0.621 0.686 0.663 0.225 0.332 ...

Overall Average Score

Model Overall Average Score
meta-llama/Meta-Llama-3.1-8B 0.472
Na0s/Llama-3.1-8B-Pruned-4-Layers 0.364
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 0.513

Environmental Impact:

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Downloads last month
28
Safetensors
Model size
6.94B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0

Dataset used to train Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0