---
base_model:
- cstr/llama3.1-8b-spaetzle-v85
- cstr/llama3.1-8b-spaetzle-v86
- cstr/llama3.1-8b-spaetzle-v74
tags:
- merge
- mergekit
- lazymergekit
- cstr/llama3.1-8b-spaetzle-v85
- cstr/llama3.1-8b-spaetzle-v86
- cstr/llama3.1-8b-spaetzle-v74
license: llama3
language:
- en
- de
---

# llama3.1-8b-spaetzle-v90

llama3.1-8b-spaetzle-v90 is a progressive merge of merges.

# evaluation

German EQ-Bench v2_de: 69.93 (171/171). English (v2): 77.88 (171/171)

[Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__llama3.1-8b-spaetzle-v90)

|      Metric       |Value|
|-------------------|----:|
|Avg.               |27.59|
|IFEval (0-Shot)    |73.56|
|BBH (3-Shot)       |32.76|
|MATH Lvl 5 (4-Shot)|13.37|
|GPQA (0-shot)      | 4.36|
|MuSR (0-shot)      |11.15|
|MMLU-PRO (5-shot)  |30.34|

|                                     Model                                      |AGIEval|TruthfulQA|Bigbench|
|--------------------------------------------------------------------------------|------:|---------:|-------:|
|[llama3.1-8b-spaetzle-v90](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90)|  42.05|      57.2|   44.75|

### AGIEval
|             Task             |Version| Metric |Value|   |Stderr|
|------------------------------|------:|--------|----:|---|-----:|
|agieval_aqua_rat              |      0|acc     |24.02|±  |  2.69|
|                              |       |acc_norm|23.62|±  |  2.67|
|agieval_logiqa_en             |      0|acc     |40.09|±  |  1.92|
|                              |       |acc_norm|39.78|±  |  1.92|
|agieval_lsat_ar               |      0|acc     |22.17|±  |  2.75|
|                              |       |acc_norm|21.74|±  |  2.73|
|agieval_lsat_lr               |      0|acc     |50.39|±  |  2.22|
|                              |       |acc_norm|45.29|±  |  2.21|
|agieval_lsat_rc               |      0|acc     |64.31|±  |  2.93|
|                              |       |acc_norm|58.36|±  |  3.01|
|agieval_sat_en                |      0|acc     |81.07|±  |  2.74|
|                              |       |acc_norm|73.79|±  |  3.07|
|agieval_sat_en_without_passage|      0|acc     |45.15|±  |  3.48|
|                              |       |acc_norm|38.83|±  |  3.40|
|agieval_sat_math              |      0|acc     |40.91|±  |  3.32|
|                              |       |acc_norm|35.00|±  |  3.22|

Average: 42.05%

### TruthfulQA
|    Task     |Version|Metric|Value|   |Stderr|
|-------------|------:|------|----:|---|-----:|
|truthfulqa_mc|      1|mc1   |39.66|±  |  1.71|
|             |       |mc2   |57.20|±  |  1.51|

Average: 57.2%

### Bigbench
|                      Task                      |Version|       Metric        |Value|   |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement                       |      0|multiple_choice_grade|58.42|±  |  3.59|
|bigbench_date_understanding                     |      0|multiple_choice_grade|70.46|±  |  2.38|
|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|31.40|±  |  2.89|
|bigbench_geometric_shapes                       |      0|multiple_choice_grade|33.43|±  |  2.49|
|                                                |       |exact_str_match      | 0.00|±  |  0.00|
|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|30.00|±  |  2.05|
|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|24.29|±  |  1.62|
|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|56.00|±  |  2.87|
|bigbench_movie_recommendation                   |      0|multiple_choice_grade|38.20|±  |  2.18|
|bigbench_navigate                               |      0|multiple_choice_grade|50.20|±  |  1.58|
|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|69.50|±  |  1.03|
|bigbench_ruin_names                             |      0|multiple_choice_grade|54.46|±  |  2.36|
|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|32.77|±  |  1.49|
|bigbench_snarks                                 |      0|multiple_choice_grade|65.19|±  |  3.55|
|bigbench_sports_understanding                   |      0|multiple_choice_grade|50.30|±  |  1.59|
|bigbench_temporal_sequences                     |      0|multiple_choice_grade|45.70|±  |  1.58|
|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|22.08|±  |  1.17|
|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|17.03|±  |  0.90|
|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|56.00|±  |  2.87|

Average: 44.75%

# merge tree

The merge tree involves the following models:

- NousResearch/Hermes-3-Llama-3.1-8B
- Undi95/Meta-Llama-3.1-8B-Claude
- Dampfinchen/Llama-3.1-8B-Ultra-Instruct
- VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct 
- akjindal53244/Llama-3.1-Storm-8B
- nbeerbower/llama3.1-gutenberg-8B
- Undi95/Meta-Llama-3.1-8B-Claude
- DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1
- nbeerbower/llama-3-wissenschaft-8B-v2
- Azure99/blossom-v5-llama3-8b
- VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct
- princeton-nlp/Llama-3-Instruct-8B-SimPO
- Locutusque/llama-3-neural-chat-v1-8b
- Locutusque/Llama-3-Orca-1.0-8B
- DiscoResearch/Llama3_DiscoLM_German_8b_v0.1_experimental
- seedboxai/Llama-3-Kafka-8B-v0.2
- VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct    
- nbeerbower/llama-3-wissenschaft-8B-v2
- mlabonne/Daredevil-8B-abliterated-dpomix

There have been a number of steps involved, among which, slep merging of only middle layers compensating for tokenizer / chat template differences. An illustration below.

## 🧩 Configuration

The final merge for this was: 

```yaml
models:
  - model: cstr/llama3.1-8b-spaetzle-v59
    # no parameters necessary for base model
  - model: cstr/llama3.1-8b-spaetzle-v85
    parameters:
      density: 0.65
      weight: 0.3
  - model: cstr/llama3.1-8b-spaetzle-v86
    parameters:
      density: 0.65
      weight: 0.3
  - model: cstr/llama3.1-8b-spaetzle-v74
    parameters:
      density: 0.65
      weight: 0.3
merge_method: dare_ties
base_model: cstr/llama3.1-8b-spaetzle-v59
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base
```

Among the previous steps:
```yaml
models:
  - model: NousResearch/Hermes-3-Llama-3.1-8B
merge_method: slerp
base_model: cstr/llama3.1-8b-spaetzle-v74
parameters:
  t:
    - value: [0, 0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0, 0]
dtype: float16
```

## 💻 Usage

Use with llama3 chat template as common. Here are GGUF quants for use with llama.cpp & wrappers as e.g. ollama: [cstr/llama3.1-8b-spaetzle-v90-GGUF](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90-GGUF)