--- base_model: - cstr/llama3.1-8b-spaetzle-v85 - cstr/llama3.1-8b-spaetzle-v86 - cstr/llama3.1-8b-spaetzle-v74 tags: - merge - mergekit - lazymergekit - cstr/llama3.1-8b-spaetzle-v85 - cstr/llama3.1-8b-spaetzle-v86 - cstr/llama3.1-8b-spaetzle-v74 license: llama3 language: - en - de --- # llama3.1-8b-spaetzle-v90 llama3.1-8b-spaetzle-v90 is a progressive merge of merges. # evaluation German EQ-Bench v2_de: 69.93 (171/171). English (v2): 77.88 (171/171) [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__llama3.1-8b-spaetzle-v90) | Metric |Value| |-------------------|----:| |Avg. |27.59| |IFEval (0-Shot) |73.56| |BBH (3-Shot) |32.76| |MATH Lvl 5 (4-Shot)|13.37| |GPQA (0-shot) | 4.36| |MuSR (0-shot) |11.15| |MMLU-PRO (5-shot) |30.34| | Model |AGIEval|TruthfulQA|Bigbench| |--------------------------------------------------------------------------------|------:|---------:|-------:| |[llama3.1-8b-spaetzle-v90](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90)| 42.05| 57.2| 44.75| ### AGIEval | Task |Version| Metric |Value| |Stderr| |------------------------------|------:|--------|----:|---|-----:| |agieval_aqua_rat | 0|acc |24.02|± | 2.69| | | |acc_norm|23.62|± | 2.67| |agieval_logiqa_en | 0|acc |40.09|± | 1.92| | | |acc_norm|39.78|± | 1.92| |agieval_lsat_ar | 0|acc |22.17|± | 2.75| | | |acc_norm|21.74|± | 2.73| |agieval_lsat_lr | 0|acc |50.39|± | 2.22| | | |acc_norm|45.29|± | 2.21| |agieval_lsat_rc | 0|acc |64.31|± | 2.93| | | |acc_norm|58.36|± | 3.01| |agieval_sat_en | 0|acc |81.07|± | 2.74| | | |acc_norm|73.79|± | 3.07| |agieval_sat_en_without_passage| 0|acc |45.15|± | 3.48| | | |acc_norm|38.83|± | 3.40| |agieval_sat_math | 0|acc |40.91|± | 3.32| | | |acc_norm|35.00|± | 3.22| Average: 42.05% ### TruthfulQA | Task |Version|Metric|Value| |Stderr| |-------------|------:|------|----:|---|-----:| |truthfulqa_mc| 1|mc1 |39.66|± | 1.71| | | |mc2 |57.20|± | 1.51| Average: 57.2% ### Bigbench | Task |Version| Metric |Value| |Stderr| |------------------------------------------------|------:|---------------------|----:|---|-----:| |bigbench_causal_judgement | 0|multiple_choice_grade|58.42|± | 3.59| |bigbench_date_understanding | 0|multiple_choice_grade|70.46|± | 2.38| |bigbench_disambiguation_qa | 0|multiple_choice_grade|31.40|± | 2.89| |bigbench_geometric_shapes | 0|multiple_choice_grade|33.43|± | 2.49| | | |exact_str_match | 0.00|± | 0.00| |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|30.00|± | 2.05| |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|24.29|± | 1.62| |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|56.00|± | 2.87| |bigbench_movie_recommendation | 0|multiple_choice_grade|38.20|± | 2.18| |bigbench_navigate | 0|multiple_choice_grade|50.20|± | 1.58| |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|69.50|± | 1.03| |bigbench_ruin_names | 0|multiple_choice_grade|54.46|± | 2.36| |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|32.77|± | 1.49| |bigbench_snarks | 0|multiple_choice_grade|65.19|± | 3.55| |bigbench_sports_understanding | 0|multiple_choice_grade|50.30|± | 1.59| |bigbench_temporal_sequences | 0|multiple_choice_grade|45.70|± | 1.58| |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.08|± | 1.17| |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.03|± | 0.90| |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|56.00|± | 2.87| Average: 44.75% # merge tree The merge tree involves the following models: - NousResearch/Hermes-3-Llama-3.1-8B - Undi95/Meta-Llama-3.1-8B-Claude - Dampfinchen/Llama-3.1-8B-Ultra-Instruct - VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct - akjindal53244/Llama-3.1-Storm-8B - nbeerbower/llama3.1-gutenberg-8B - Undi95/Meta-Llama-3.1-8B-Claude - DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1 - nbeerbower/llama-3-wissenschaft-8B-v2 - Azure99/blossom-v5-llama3-8b - VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct - princeton-nlp/Llama-3-Instruct-8B-SimPO - Locutusque/llama-3-neural-chat-v1-8b - Locutusque/Llama-3-Orca-1.0-8B - DiscoResearch/Llama3_DiscoLM_German_8b_v0.1_experimental - seedboxai/Llama-3-Kafka-8B-v0.2 - VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct - nbeerbower/llama-3-wissenschaft-8B-v2 - mlabonne/Daredevil-8B-abliterated-dpomix There have been a number of steps involved, among which, slep merging of only middle layers compensating for tokenizer / chat template differences. An illustration below. ## 🧩 Configuration The final merge for this was: ```yaml models: - model: cstr/llama3.1-8b-spaetzle-v59 # no parameters necessary for base model - model: cstr/llama3.1-8b-spaetzle-v85 parameters: density: 0.65 weight: 0.3 - model: cstr/llama3.1-8b-spaetzle-v86 parameters: density: 0.65 weight: 0.3 - model: cstr/llama3.1-8b-spaetzle-v74 parameters: density: 0.65 weight: 0.3 merge_method: dare_ties base_model: cstr/llama3.1-8b-spaetzle-v59 parameters: int8_mask: true dtype: bfloat16 random_seed: 0 tokenizer_source: base ``` Among the previous steps: ```yaml models: - model: NousResearch/Hermes-3-Llama-3.1-8B merge_method: slerp base_model: cstr/llama3.1-8b-spaetzle-v74 parameters: t: - value: [0, 0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0, 0] dtype: float16 ``` ## 💻 Usage Use with llama3 chat template as common. Here are GGUF quants for use with llama.cpp & wrappers as e.g. ollama: [cstr/llama3.1-8b-spaetzle-v90-GGUF](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90-GGUF)