tenyx
/

Llama3-TenyxChat-70B

@@ -92,8 +92,8 @@ Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 chall
 | gpt-4-0125-preview             |  78.0  | 95% CI: (-1.8, 2.2) |
 | claude-3-opus-20240229         |  60.4  | 95% CI: (-2.6, 2.1) |
 | gpt-4-0314                     |  50.0  | 95% CI:  (0.0, 0.0) |
-| **tenyx/Llama3-TenyxChat-70B** |  49.0  | 95% CI: (-3.0, 2.4) |
-| meta-llama/Meta-Llama-3-70B-In |  47.3  | 95% CI: (-1.7, 2.6) |
 | claude-3-sonnet-20240229       |  46.8  | 95% CI: (-2.7, 2.3) |
 | claude-3-haiku-20240307        |  41.5  | 95% CI: (-2.4, 2.5) |
 | gpt-4-0613                     |  37.9  | 95% CI: (-2.1, 2.2) |
@@ -101,6 +101,18 @@ Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 chall
 | Qwen1.5-72B-Chat               |  36.1  | 95% CI: (-2.1, 2.4) |
 | command-r-plus                 |  33.1  | 95% CI: (-2.0, 1.9) |
 # Limitations
 Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.

 | gpt-4-0125-preview             |  78.0  | 95% CI: (-1.8, 2.2) |
 | claude-3-opus-20240229         |  60.4  | 95% CI: (-2.6, 2.1) |
 | gpt-4-0314                     |  50.0  | 95% CI:  (0.0, 0.0) |
+| **tenyx/Llama3-TenyxChat-70B** |  **49.0**  | 95% CI: (-3.0, 2.4) |
+| *meta-llama/Meta-Llama-3-70B-In* |  47.3  | 95% CI: (-1.7, 2.6) |
 | claude-3-sonnet-20240229       |  46.8  | 95% CI: (-2.7, 2.3) |
 | claude-3-haiku-20240307        |  41.5  | 95% CI: (-2.4, 2.5) |
 | gpt-4-0613                     |  37.9  | 95% CI: (-2.1, 2.2) |
 | Qwen1.5-72B-Chat               |  36.1  | 95% CI: (-2.1, 2.4) |
 | command-r-plus                 |  33.1  | 95% CI: (-2.0, 1.9) |
+## Open LLM Leaderboard Evaluation
+We now present our results on the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) used for benchmarking Open LLM Leaderboard on Hugging Face.
+The task involves evaluation on `6` key benchmarks across reasoning and knowledge with different *few-shot* settings. Read more details about the benchmark at [the leaderboard page](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
+|  | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| **Llama3-TenyxChat-70B** | **79.43** | 72.53 | 86.11 | 79.95 | 62.93 | 83.82 | 91.21 |
+| *Llama3-70B-Instruct* | 77.88 | 71.42 | 85.69 | 80.06 | 61.81 | 82.87 | 85.44 |
+*The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
 # Limitations
 Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.