Model Card for Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0:
Model Details:
Model Description:
- Finetuned from model: Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-2.0 on teknium/openhermes.
- We pruned the 4 layers of meta-llama/Meta-Llama-3.1-8B that had the less impact on the performance of the model according to the paper The Unreasonable Ineffectiveness of the Deeper Layers.
- We have therefore 1.09B parameters less than the foundation model, which means less memory needed, faster training and less latency during inference mode.
- We then recovered the performance loss induced by the pruning process by fine-tuning (from 0.2642 MMLU-Pro 0-shot to 0.3120), this step is called healing the pruned model.
Upcoming Work:
- More healing through SFT/DPO/TPO to see if we can get closer to the meta-llama/Meta-Llama-3.1-8B performance (which has an MMLU-Pro 0-shot of 0.3659 vs 0.3120 for our model). (In Progress)
- Compare the same exact process when applied to meta-llama/LLama-3.1-70B.
Training Details:
model = FastLanguageModel.get_peft_model(
model,
r = 4,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 4,
lora_dropout = 0.05,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "completion",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 10,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps=5000,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs_4",
push_to_hub=True,
hub_always_push=True,
),
)
Training Data:
teknium/openhermes
Load Mode Memory Metrics
Model |
Max Global VRAM (MB) |
Max Process VRAM (MB) |
Max Reserved VRAM (MB) |
Max Allocated VRAM (MB) |
Llama-3.1-8B |
18521.98 |
16630.42 |
16196.30 |
16060.54 |
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 |
16319.97 |
14428.41 |
13994.30 |
13879.42 |
Inference Mode Latency Metrics
Model |
Latency Mean (s) |
Throughput (tokens/s) |
Llama-3.1-8B |
0.8104 |
38.2536 |
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 |
0.5530 |
56.0570 |
Evaluation:
- (Foundation model) MMLU Pro 0-shot of meta-llama/Meta-Llama-3.1-8B: 0.3659
- (Pruned model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers: 0.2642
- (Healed model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0: 0.3120
Evaluation Data and Process:
Additional Benchmark Results
BoolQ 0-shots Benchmark Results
Model |
Average Score |
boolq (0 shots) |
boolq contrastset (0 shots) |
meta-llama/Meta-Llama-3.1-8B |
0.569 |
0.569 |
0.568 |
Na0s/Llama-3.1-8B-Pruned-4-Layers |
0.240 |
0.240 |
0.240 |
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 |
0.833 |
0.834 |
0.831 |
BigBench 0-shots Benchmark Results
Model |
Average Score |
bigbench:causal_judgment (0 shots) |
bigbench:date_understanding (0 shots) |
bigbench:disambiguation_qa (0 shots) |
bigbench:geometric_shapes (0 shots) |
bigbench:logical_deduction (0 shots) |
... |
meta-llama/Meta-Llama-3.1-8B |
0.351 |
0.574 |
0.499 |
0.302 |
0.164 |
0.208 |
... |
Na0s/Llama-3.1-8B-Pruned-4-Layers |
0.299 |
0.537 |
0.341 |
0.314 |
0.200 |
0.212 |
... |
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 |
0.364 |
0.579 |
0.610 |
0.407 |
0.264 |
0.208 |
... |
Few Shots Benchmark Results
Model |
Average Score |
arc:challenge (25 shots) |
hellaswag (10 shots) |
mmlu:abstract_algebra (5 shots) |
mmlu:college_chemistry (5 shots) |
mmlu:college_computer_science (5 shots) |
mmlu:college_mathematics (5 shots) |
... |
meta-llama/Meta-Llama-3.1-8B |
0.552 |
0.541 |
0.620 |
0.290 |
0.450 |
0.480 |
0.350 |
... |
Na0s/Llama-3.1-8B-Pruned-4-Layers |
0.516 |
0.462 |
0.549 |
0.290 |
0.440 |
0.460 |
0.280 |
... |
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 |
0.544 |
0.479 |
0.554 |
0.340 |
0.480 |
0.520 |
0.350 |
... |
BigBench 3-shots Benchmark Results
Model |
Average Score |
bigbench:causal_judgment (3 shots) |
bigbench:date_understanding (3 shots) |
bigbench:disambiguation_qa (3 shots) |
bigbench:geometric_shapes (3 shots) |
bigbench:logical_deduction (3 shots) |
... |
meta-llama/Meta-Llama-3.1-8B |
0.442 |
0.563 |
0.596 |
0.593 |
0.181 |
0.298 |
... |
Na0s/Llama-3.1-8B-Pruned-4-Layers |
0.420 |
0.563 |
0.642 |
0.574 |
0.217 |
0.258 |
... |
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 |
0.450 |
0.621 |
0.686 |
0.663 |
0.225 |
0.332 |
... |
Overall Average Score
Model |
Overall Average Score |
meta-llama/Meta-Llama-3.1-8B |
0.472 |
Na0s/Llama-3.1-8B-Pruned-4-Layers |
0.364 |
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 |
0.513 |
Environmental Impact:
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).