Update REAMDE
Browse files
README.md
CHANGED
@@ -195,7 +195,7 @@ More specifically, we do not change prompts, pick different few-shot examples, c
|
|
195 |
|
196 |
The number of k–shot examples is listed per-benchmark.
|
197 |
|
198 |
-
|Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct
|
199 |
|---------|-----------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
|
200 |
|AGI Eval<br>5-shot|49.7|50.1|54.0|56.9|48.4|49.0|59.6|
|
201 |
|MMLU<br>5-shot|76.6|73.8|76.2|80.2|71.4|66.7|84.0|
|
@@ -220,7 +220,7 @@ The number of k–shot examples is listed per-benchmark.
|
|
220 |
|
221 |
We take a closer look at different categories across 80 public benchmark datasets at the table below:
|
222 |
|
223 |
-
|Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct
|
224 |
|--------|------------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
|
225 |
| Popular aggregated benchmark | 72.3 | 69.9 | 73.4 | 76.3 | 67.0 | 67.5 | 80.5 |
|
226 |
| Reasoning | 83.2 | 79.3 | 81.5 | 86.7 | 78.3 | 80.4 | 89.3 |
|
|
|
195 |
|
196 |
The number of k–shot examples is listed per-benchmark.
|
197 |
|
198 |
+
|Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
|
199 |
|---------|-----------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
|
200 |
|AGI Eval<br>5-shot|49.7|50.1|54.0|56.9|48.4|49.0|59.6|
|
201 |
|MMLU<br>5-shot|76.6|73.8|76.2|80.2|71.4|66.7|84.0|
|
|
|
220 |
|
221 |
We take a closer look at different categories across 80 public benchmark datasets at the table below:
|
222 |
|
223 |
+
|Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
|
224 |
|--------|------------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
|
225 |
| Popular aggregated benchmark | 72.3 | 69.9 | 73.4 | 76.3 | 67.0 | 67.5 | 80.5 |
|
226 |
| Reasoning | 83.2 | 79.3 | 81.5 | 86.7 | 78.3 | 80.4 | 89.3 |
|