sarath-shekkizhar commited on
Commit
b36ec51
1 Parent(s): d37e3d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -2
README.md CHANGED
@@ -92,8 +92,8 @@ Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 chall
92
  | gpt-4-0125-preview | 78.0 | 95% CI: (-1.8, 2.2) |
93
  | claude-3-opus-20240229 | 60.4 | 95% CI: (-2.6, 2.1) |
94
  | gpt-4-0314 | 50.0 | 95% CI: (0.0, 0.0) |
95
- | **tenyx/Llama3-TenyxChat-70B** | 49.0 | 95% CI: (-3.0, 2.4) |
96
- | meta-llama/Meta-Llama-3-70B-In | 47.3 | 95% CI: (-1.7, 2.6) |
97
  | claude-3-sonnet-20240229 | 46.8 | 95% CI: (-2.7, 2.3) |
98
  | claude-3-haiku-20240307 | 41.5 | 95% CI: (-2.4, 2.5) |
99
  | gpt-4-0613 | 37.9 | 95% CI: (-2.1, 2.2) |
@@ -101,6 +101,18 @@ Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 chall
101
  | Qwen1.5-72B-Chat | 36.1 | 95% CI: (-2.1, 2.4) |
102
  | command-r-plus | 33.1 | 95% CI: (-2.0, 1.9) |
103
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  # Limitations
105
 
106
  Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
 
92
  | gpt-4-0125-preview | 78.0 | 95% CI: (-1.8, 2.2) |
93
  | claude-3-opus-20240229 | 60.4 | 95% CI: (-2.6, 2.1) |
94
  | gpt-4-0314 | 50.0 | 95% CI: (0.0, 0.0) |
95
+ | **tenyx/Llama3-TenyxChat-70B** | **49.0** | 95% CI: (-3.0, 2.4) |
96
+ | *meta-llama/Meta-Llama-3-70B-In* | 47.3 | 95% CI: (-1.7, 2.6) |
97
  | claude-3-sonnet-20240229 | 46.8 | 95% CI: (-2.7, 2.3) |
98
  | claude-3-haiku-20240307 | 41.5 | 95% CI: (-2.4, 2.5) |
99
  | gpt-4-0613 | 37.9 | 95% CI: (-2.1, 2.2) |
 
101
  | Qwen1.5-72B-Chat | 36.1 | 95% CI: (-2.1, 2.4) |
102
  | command-r-plus | 33.1 | 95% CI: (-2.0, 1.9) |
103
 
104
+ ## Open LLM Leaderboard Evaluation
105
+
106
+ We now present our results on the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) used for benchmarking Open LLM Leaderboard on Hugging Face.
107
+ The task involves evaluation on `6` key benchmarks across reasoning and knowledge with different *few-shot* settings. Read more details about the benchmark at [the leaderboard page](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
108
+
109
+ | | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
110
+ | --- | --- | --- | --- | --- | --- | --- | --- |
111
+ | **Llama3-TenyxChat-70B** | **79.43** | 72.53 | 86.11 | 79.95 | 62.93 | 83.82 | 91.21 |
112
+ | *Llama3-70B-Instruct* | 77.88 | 71.42 | 85.69 | 80.06 | 61.81 | 82.87 | 85.44 |
113
+
114
+ *The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
115
+
116
  # Limitations
117
 
118
  Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.