Explain these Benchmark Results

#2
by Joseph717171 - opened

@chargoddard , @Crystalcareai , @Undi95 Please explain these benchmark results. How can merging an instruct model with its ancestral base model improve the model in each benchmark? Why is there no degradation in performance or loss in Benchmark scores like we typically see in model merges? 🤔

Joseph717171/Llama-3.1-SuperNova-8B-Lite_TIES_with_Base's Benchmarks

Metric Value
Average Score 43.06
IFEval (0-Shot) 80.96
BBH (3-Shot) 51.10
MATH Lvl 5 (4-Shot. ) 15.56
GPQA (0-shot) 30.96
MuSR (0-shot) 41.01
MMLU-PRO (5-shot) 38.80

arcee-ai/Llama-3.1-SuperNova-Lite's Benchmarks

Metric Value
Average Score 29.73
IFEval (0-Shot) 80.17
BBH (3-Shot) 31.57
MATH Lvl 5 (4-Shot) 15.48
GPQA (0-shot) 7.49
MuSR (0-shot) 11.67
MMLU-PRO (5-shot) 31.97

Mergekit Config

models:
  - model: "/Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite"
    parameters:
      weight: 1
      density: 1

  - model: "/Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite"
    parameters:
      weight: 1
      density: 1

merge_method: ties
base_model: "/Users/jsarnecki/opt/Workspace/meta-llama/Llama-3.1-8B"
parameters:
  density: 1
  normalize: true
  int8_mask: true
dtype: bfloat16

I'm toying with that, Jsarnecki spoke about that to me too, for very specific task it seem to follow less instruction (on what it was trained at first), that's the feedback I got from outside, otherwise seem to make the model more solid (imo) and can even make the talking better (but more average/flowery). I didn't made enough try to be sure about anything yet.

Merging with a previous checkpoint is a wonderful regularization technique. But definitely more research is needing on how/when/why to merge and why exactly it can work as well as it does.

Sign up or log in to comment