Holland-4B-GGUF / README.md
Delta-Vector's picture
Update README.md
0fbae17 verified
metadata
License: apache-2.0
Language:
  - En
Pipeline_tag: text-generation
Base_model: nvidia/Llama-3.1-Minitron-4 B-Width-Base
Tags:
  - Chat
license: agpl-3.0
datasets:
  - anthracite-org/kalo-opus-instruct-22k-no-refusal
  - PJMixers/lodrick-the-lafted_OpusStories-ShareGPT
  - NewEden/Gryphe-3.5-16k-Subset
  - Epiculous/Synthstruct-Gens-v1.1-Filtered-n-Cleaned
tags:
  - chat

A model made to continue off my previous work on Magnum 4B, A small model made for creative writing / General assistant tasks, finetuned ontop of IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml, this model is made to be more coherent and generally be better then the 4B at both writing and assistant tasks.

GGUF quants of Holland 4B, Original weights can be found here

Prompting

Model has been Instruct tuned with the ChatML formatting. A typical input would look like this:

"""<|im_start|>system
system prompt<|im_end|>
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
"""

Support

No longer needed - Just update to the latest version of LCPP or wait for support

To run inference on this model, you'll need to use Aphrodite, vLLM or EXL 2/tabbyAPI, as llama.cpp hasn't yet merged the required pull request to fix the llama 3.1 rope_freqs issue with custom head dimensions.

However, you can work around this by quantizing the model yourself to create a functional GGUF file. Note that until this PR is merged, the context will be limited to 8 k tokens.

To create a working GGUF file, make the following adjustments:

  1. Remove the "rope_scaling": {} entry from config.json
  2. Change "max_position_embeddings" to 8192 in config.json

These modifications should allow you to use the model with llama. Cpp, albeit with the mentioned context limitation.

Axolotl config

See axolotl config

Axolotl version: 0.4.1

base_model: IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: NewEden/Gryphe-3.5-16k-Subset
    type: sharegpt
    conversation: chatml
  - path: Epiculous/Synthstruct-Gens-v1.1-Filtered-n-Cleaned
    type: sharegpt
    conversation: chatml
  - path: anthracite-org/kalo-opus-instruct-22k-no-refusal
    type: sharegpt
    conversation: chatml
  - path: PJMixers/lodrick-the-lafted_OpusStories-ShareGPT
    type: sharegpt
    conversation: chatml

chat_template: chatml

val_set_size: 0.01
output_dir: ./outputs/out

adapter:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:

sequence_len: 16384
# sequence_len: 32768
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

wandb_project: 
wandb_entity:
wandb_watch:
wandb_name: 
wandb_log_model:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
#optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.00002
weight_decay: 0.05

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1

debug:
deepspeed: /workspace/axolotl/deepspeed_configs/zero2.json
#deepspeed:
fsdp:
fsdp_config:

special_tokens:
  pad_token: <|finetune_right_pad_id|>

Credits

Training

The training was done for 2 epochs. We used 2 x RTX 6000s GPUs graciously provided by Kubernetes_Bad for the full-parameter fine-tuning of the model.

Built with Axolotl

Safety

...