PocketDoc's picture
Update README.md
c84e978 verified
metadata
library_name: transformers
base_model: Dans-DiscountModels/Meta-Llama-3.1-8B-ChatML
tags:
  - generated_from_trainer
model-index:
  - name: l3.1-8b-dans-instruct
    results: []
license: apache-2.0

Built with Axolotl

See axolotl config

axolotl version: 0.4.1

base_model: Dans-DiscountModels/Meta-Llama-3.1-8B-ChatML
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

trust_remote_code:

# wandb configuration
wandb_project: l3.1-8b-dans-instruct
wandb_watch:
wandb_run_id:
wandb_log_model: 

# where to save the finished model to
output_dir: ./l3.1-8b-dans-instruct

# dataset settings (local or huggingface repo)
datasets:
  - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
    type: sharegpt
    conversation: chatml
  - path: AquaV/Energetic-Materials-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/Chemical-Biological-Safety-Applications-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/US-Army-Survival-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/Resistance-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/Interrogation-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Mathmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Benchmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Codemaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Taskmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-ASCIIMaxx-Wordart
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Prosemaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Toolmaxx
    type: sharegpt
    conversation: chatml

chat_template: chatml

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

load_in_8bit: false
load_in_4bit: false
strict: false

dataset_prepared_path: ./l3.1-8b-dans-instruct-data
val_set_size: 0.03

lora_model_dir: 

sequence_len: 8192

# use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
sample_packing: true
eval_sample_packing: true

# you can set these packing optimizations AFTER starting a training at least once.
# The trainer will provide recommended values for these values.

pad_to_sequence_len: true

#rope_scaling:
  #type:  # linear | dynamic
  #factor:  # float (2 for 2x)

adapter: # blank for full finetune
lora_r: 64
lora_alpha: 64
lora_dropout: 0.2
lora_target_linear: True
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head
lora_fan_in_fan_out:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0000015
cosine_min_lr_ratio: 

train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 15
eval_steps: 25
# save_steps: 100
saves_per_epoch: 3
debug: false
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:


special_tokens:
  pad_token: <|finetune_right_pad_id|>
  eos_token: <|im_end|>

l3.1-8b-dans-instruct

This model is a fine-tuned version of Dans-DiscountModels/Meta-Llama-3.1-8B-ChatML on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6699

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1.5e-06
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • gradient_accumulation_steps: 32
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 15
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss
0.9964 0.0041 1 1.0348
0.8433 0.1025 25 0.8220
0.7916 0.2049 50 0.7465
0.7381 0.3074 75 0.7152
0.6802 0.4098 100 0.7005
0.7764 0.5123 125 0.6917
0.6518 0.6148 150 0.6871
0.6864 0.7172 175 0.6831
0.7217 0.8197 200 0.6803
0.7072 0.9221 225 0.6781
0.6953 1.0287 250 0.6764
0.8013 1.1313 275 0.6752
0.6296 1.2338 300 0.6738
0.7553 1.3364 325 0.6729
0.6749 1.4390 350 0.6722
0.6619 1.5415 375 0.6715
0.6527 1.6441 400 0.6712
0.7654 1.7467 425 0.6707
0.7256 1.8492 450 0.6705
0.6921 1.9518 475 0.6701
0.6982 2.0523 500 0.6701
0.6997 2.1548 525 0.6701
0.6563 2.2574 550 0.6700
0.6564 2.3599 575 0.6699
0.6248 2.4624 600 0.6699
0.6893 2.5650 625 0.6699
0.6633 2.6675 650 0.6698
0.7045 2.7701 675 0.6698
0.7784 2.8726 700 0.6698
0.7798 2.9751 725 0.6699

Framework versions

  • Transformers 4.45.0.dev0
  • Pytorch 2.4.0+cu121
  • Datasets 2.21.0
  • Tokenizers 0.19.1