Adding Evaluation Results (#1)

d3beaf3 verified 2 months ago

7.36 kB

	---
	language:
	- en
	license: llama3.1
	library_name: transformers
	tags:
	- mergekit
	- merge
	base_model:
	- meta-llama/Meta-Llama-3.1-70B-Instruct
	- NousResearch/Hermes-3-Llama-3.1-70B
	- abacusai/Dracarys-Llama-3.1-70B-Instruct
	- VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct
	model-index:
	- name: Brinebreath-Llama-3.1-70B
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 55.33
	name: strict accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 55.46
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 29.98
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 12.86
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 17.49
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 46.62
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B
	name: Open LLM Leaderboard
	---


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/649dc85249ae3a68334adcc6/yDDOz1fsWfSviCGtCh3f3.png)
	Brinebreath-Llama-3.1-70B
	=====================================

	I made this since I started having some problems with Cathallama. This seems to behave well during some days testing.

	Notable Performance

	* 7% overall success rate increase on MMLU-PRO over LLaMA 3.1 70b at Q4_0
	* Strong performance in MMLU-PRO categories overall
	* Great performance during manual testing

	Creation workflow
	=====================
	Models merged
	* meta-llama/Meta-Llama-3.1-70B-Instruct
	* NousResearch/Hermes-3-Llama-3.1-70B
	* abacusai/Dracarys-Llama-3.1-70B-Instruct
	* VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct

	```
	flowchart TD
	A[Hermes 3] -->\|Merge with\| B[Meta-Llama-3.1]
	C[Dracarys] -->\|Merge with\| D[Meta-Llama-3.1]
	B -->\| \| E[Merge]
	D -->\| \| E[Merge]
	G[SauerkrautLM] -->\|Merge with\| E[Merge]
	E[Merge] -->\| \| F[Brinebreath]
	```

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/649dc85249ae3a68334adcc6/3cjOUfghMD2GvxL7a3SOh.png)

	Testing
	=====================

	Hyperparameters
	---------------

	* Temperature: 0.0 for automated, 0.9 for manual
	* Penalize repeat sequence: 1.05
	* Consider N tokens for penalize: 256
	* Penalize repetition of newlines
	* Top-K sampling: 40
	* Top-P sampling: 0.95
	* Min-P sampling: 0.05

	LLaMAcpp Version
	------------------

	* b3600-1-g2339a0be
	* -fa -ngl -1 -ctk f16 --no-mmap

	Tested Files
	------------------

	* Brinebreath-Llama-3.1-70B.Q4_0.gguf
	* Meta-Llama-3.1-70B-Instruct.Q4_0.gguf


	Manual testing

	\| Category \| Test Case \| Brinebreath-Llama-3.1-70B.Q4_0.gguf \| Meta-Llama-3.1-70B-Instruct.Q4_0.gguf \|
	\| --- \| --- \| --- \| --- \|
	\| Common Sense \| Ball on cup \| OK \| OK \|
	\| \| Big duck small horse \| OK \| OK \|
	\| \| Killers \| OK \| OK \|
	\| \| Strawberry r's \| <span style="color: red;">KO</span> \| <span style="color: red;">KO</span> \|
	\| \| 9.11 or 9.9 bigger \| <span style="color: red;">KO</span> \| <span style="color: red;">KO</span> \|
	\| \| Dragon or lens \| <span style="color: red;">KO</span> \| <span style="color: red;">KO</span> \|
	\| \| Shirts \| OK \| <span style="color: red;">KO</span> \|
	\| \| Sisters \| OK \| <span style="color: red;">KO</span> \|
	\| \| Jane faster \| OK \| OK \|
	\| Programming \| JSON \| OK \| OK \|
	\| \| Python snake game \| OK \| <span style="color: red;">KO</span> \|
	\| Math \| Door window combination \| OK \| <span style="color: red;">KO</span> \|
	\| Smoke \| Poem \| OK \| OK \|
	\| \| Story \| OK \| OK \|

	Note: See [sample_generations.txt](https://huggingface.co/gbueno86/Brinebreath-Llama-3.1-70B/blob/main/sample_generations.txt) on the main folder of the repo for the raw generations.

	MMLU-PRO

	\| Model \| Success % \|
	\| --- \| --- \|
	\| Brinebreath-3.1-70B.Q4_0.gguf \| 49.0% \|
	\| Meta-Llama-3.1-70B-Instruct.Q4_0.gguf \| 42.0% \|


	\| MMLU-PRO category\| Brinebreath-3.1-70B.Q4_0.gguf \| Meta-Llama-3.1-70B-Instruct.Q4_0.gguf \|
	\| --- \| --- \| --- \|
	\| Business \| 45.0% \| 40.0% \|
	\| Law \| 40.0% \| 35.0% \|
	\| Psychology \| 85.0% \| 80.0% \|
	\| Biology \| 80.0% \| 75.0% \|
	\| Chemistry \| 50.0% \| 45.0% \|
	\| History \| 65.0% \| 60.0% \|
	\| Other \| 55.0% \| 50.0% \|
	\| Health \| 70.0% \| 65.0% \|
	\| Economics \| 80.0% \| 75.0% \|
	\| Math \| 35.0% \| 30.0% \|
	\| Physics \| 45.0% \| 40.0% \|
	\| Computer Science \| 60.0% \| 55.0% \|
	\| Philosophy \| 50.0% \| 45.0% \|
	\| Engineering \| 45.0% \| 40.0% \|

	Note: MMLU-PRO Overall tested with 100 questions. Categories testes with 20 questions from each category.

	PubmedQA

	Model Name \| Success% \|
	\| --- \| --- \|
	\| Brinebreath-3.1-70B.Q4_0.gguf\| 71.00% \|
	\| Meta-Llama-3.1-70B-Instruct.Q4_0.gguf \| 68.00% \|


	Note: PubmedQA tested with 100 questions.


	Request
	--------------
	If you are hiring in the EU or can sponsor a visa, PM me :D

	PS. Thank you mradermacher for the GGUFs!
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_gbueno86__Brinebreath-Llama-3.1-70B)

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \|36.29\|
	\|IFEval (0-Shot) \|55.33\|
	\|BBH (3-Shot) \|55.46\|
	\|MATH Lvl 5 (4-Shot)\|29.98\|
	\|GPQA (0-shot) \|12.86\|
	\|MuSR (0-shot) \|17.49\|
	\|MMLU-PRO (5-shot) \|46.62\|