metadata

base_model: unsloth/Mistral-Nemo-Base-2407
language:
  - en
license: apache-2.0
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - mistral
  - trl
  - rp
  - writing
  - gguf
  - experimental
  - long-context

Uploaded model

Developed by: UsernameJustAnother
License: apache-2.0
Finetuned from model : unsloth/Mistral-Nemo-Base-2407

Standard disclaimer: This is me teaching myself the basics of fine-tuning, with notes extensively borrowed from https://huggingface.co/nothingiisreal/MN-12B-Celeste-V1.9

This is a Q_8.0 gguf of UsernameJustAnother/Nemo-12B-Marlin-v8.

New for v8:

Fine-tuned on Nemo Base instead of Instruct, because why not?
FULL BORE MODE: ACTIVATE! 10K-ish records of mostly-human convos and stories, curated by me, trained in ChatML, up from 8K in v6. Specifically:
- 4K records from Reddit Writing Prompts (equal split of highest-rated sfw & nfsw)
- 2K of Claude instruct, lightly curated & de-clauded.
- 2K of curated Fallen Skies
- 2K of curated/lightly de-ministrated C2 chat
Trained on a single 80GB A100 from runpod.io, with batch size of 8 (up from 2 on A100 40G), so far less steps involved.

I pulled v7 because I honestly don't think it's as good as v6, and don't want folks to get the wrong idea that it's better just because the version number is higher.

Props again to Unsloth.ai for letting me train this on a single A100 with variable (wildly variable) context length.

Here's what the train/eval loss looked like:

I still don't know what makes training loss drop at the end of epoch 1, or why eval loss doesn't drop down to match (it continues to decrease, but slowly).

It was trained with the following settings:


model = FastLanguageModel.get_peft_model(
    model,
    r = 256,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128,  #   128 / sqrt(256) gives a scaling factor of 8
    lora_dropout = 0.1, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # setting the adapter scaling factor to lora_alpha/math.sqrt(r) instead of lora_alpha/r
    loftq_config = None, # And LoftQ
)

lr_scheduler_kwargs = {
    'min_lr': 0.0000024  # Adjust this value as needed
}

        per_device_train_batch_size = 8,
        per_device_eval_batch_size = 8,
        gradient_accumulation_steps = 4,
        eval_accumulation_steps = 4,
        prediction_loss_only = True, # When performing evaluation and generating predictions, only returns the loss.
        warmup_steps = 50,
        num_train_epochs = 2, # For longer training runs! 12 hrs/epoch?
        learning_rate = 5e-5, 
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        fp16_full_eval = True, # stops eval from trying to use fp32
        eval_strategy = "steps", # 'no', 'steps', 'epoch'. Don't use this without an eval dataset etc
        eval_steps = 50, # is eval_strat is set to 'steps', do every N steps.
        logging_steps = 5, # so eval and logging happen on the same schedule
        optim = "adamw_8bit", # 
        weight_decay = 0, # up from 0
        lr_scheduler_type = "cosine_with_min_lr", # linear, cosine, cosine_with_min_lr, default linear
        lr_scheduler_kwargs = lr_scheduler_kwargs, # needed for cosine_with_min_lr
        seed = 3407,

This mistral model was trained 2x faster with Unsloth and Huggingface's TRL library.