Edit model card

image/png Brinebreath-Llama-3.1-70B

I made this since I started having some problems with Cathallama. This seems to behave well during some days testing.

Notable Performance

  • 7% overall success rate increase on MMLU-PRO over LLaMA 3.1 70b at Q4_0
  • Strong performance in MMLU-PRO categories overall
  • Great performance during manual testing

Creation workflow

Models merged

  • meta-llama/Meta-Llama-3.1-70B-Instruct
  • NousResearch/Hermes-3-Llama-3.1-70B
  • abacusai/Dracarys-Llama-3.1-70B-Instruct
  • VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct
flowchart TD
    A[Hermes 3] -->|Merge with| B[Meta-Llama-3.1]
    C[Dracarys] -->|Merge with| D[Meta-Llama-3.1]
    B -->| | E[Merge]
    D -->| | E[Merge]
    G[SauerkrautLM] -->|Merge with| E[Merge]
    E[Merge] -->| | F[Brinebreath]

image/png

Testing

Hyperparameters

  • Temperature: 0.0 for automated, 0.9 for manual
  • Penalize repeat sequence: 1.05
  • Consider N tokens for penalize: 256
  • Penalize repetition of newlines
  • Top-K sampling: 40
  • Top-P sampling: 0.95
  • Min-P sampling: 0.05

LLaMAcpp Version

  • b3600-1-g2339a0be
  • -fa -ngl -1 -ctk f16 --no-mmap

Tested Files

  • Brinebreath-Llama-3.1-70B.Q4_0.gguf
  • Meta-Llama-3.1-70B-Instruct.Q4_0.gguf

Manual testing

Category Test Case Brinebreath-Llama-3.1-70B.Q4_0.gguf Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Common Sense Ball on cup OK OK
Big duck small horse OK OK
Killers OK OK
Strawberry r's KO KO
9.11 or 9.9 bigger KO KO
Dragon or lens KO KO
Shirts OK KO
Sisters OK KO
Jane faster OK OK
Programming JSON OK OK
Python snake game OK KO
Math Door window combination OK KO
Smoke Poem OK OK
Story OK OK

Note: See sample_generations.txt on the main folder of the repo for the raw generations.

MMLU-PRO

Model Success %
Brinebreath-3.1-70B.Q4_0.gguf 49.0%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf 42.0%
MMLU-PRO category Brinebreath-3.1-70B.Q4_0.gguf Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Business 45.0% 40.0%
Law 40.0% 35.0%
Psychology 85.0% 80.0%
Biology 80.0% 75.0%
Chemistry 50.0% 45.0%
History 65.0% 60.0%
Other 55.0% 50.0%
Health 70.0% 65.0%
Economics 80.0% 75.0%
Math 35.0% 30.0%
Physics 45.0% 40.0%
Computer Science 60.0% 55.0%
Philosophy 50.0% 45.0%
Engineering 45.0% 40.0%

Note: MMLU-PRO Overall tested with 100 questions. Categories testes with 20 questions from each category.

PubmedQA

Model Name Success%
Brinebreath-3.1-70B.Q4_0.gguf 71.00%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf 68.00%

Note: PubmedQA tested with 100 questions.

Request

If you are hiring in the EU or can sponsor a visa, PM me :D

PS. Thank you mradermacher for the GGUFs!

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 36.29
IFEval (0-Shot) 55.33
BBH (3-Shot) 55.46
MATH Lvl 5 (4-Shot) 29.98
GPQA (0-shot) 12.86
MuSR (0-shot) 17.49
MMLU-PRO (5-shot) 46.62
Downloads last month
239
Safetensors
Model size
70.6B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for gbueno86/Brinebreath-Llama-3.1-70B

Evaluation results