microsoft
/

Phi-3-medium-128k-instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

nguyenbh commited on May 21

Commit

cfe82c6

•

1 Parent(s): a3c8c89

Update REAMDE

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -195,7 +195,7 @@ More specifically, we do not change prompts, pick different few-shot examples, c
 The number of k–shot examples is listed per-benchmark.
-|Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct<br>8b|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
 |---------|-----------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
 |AGI Eval<br>5-shot|49.7|50.1|54.0|56.9|48.4|49.0|59.6|
 |MMLU<br>5-shot|76.6|73.8|76.2|80.2|71.4|66.7|84.0|
@@ -220,7 +220,7 @@ The number of k–shot examples is listed per-benchmark.
 We take a closer look at different categories across 80 public benchmark datasets at the table below:
-|Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct<br>8b|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
 |--------|------------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
 | Popular aggregated benchmark | 72.3 | 69.9 | 73.4 | 76.3 | 67.0 | 67.5 | 80.5 |
 | Reasoning                    | 83.2 | 79.3 | 81.5 | 86.7 | 78.3 | 80.4 | 89.3 |

 The number of k–shot examples is listed per-benchmark.
+|Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
 |---------|-----------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
 |AGI Eval<br>5-shot|49.7|50.1|54.0|56.9|48.4|49.0|59.6|
 |MMLU<br>5-shot|76.6|73.8|76.2|80.2|71.4|66.7|84.0|
 We take a closer look at different categories across 80 public benchmark datasets at the table below:
+|Benchmark|Phi-3-Medium-128k-Instruct<br>14b|Command R+<br>104B|Mixtral<br>8x22B|Llama-3-70B-Instruct|GPT3.5-Turbo<br>version 1106|Gemini<br>Pro|GPT-4-Turbo<br>version 1106 (Chat)|
 |--------|------------------------|--------|-------------|-------------------|-------------------|----------|------------------------|
 | Popular aggregated benchmark | 72.3 | 69.9 | 73.4 | 76.3 | 67.0 | 67.5 | 80.5 |
 | Reasoning                    | 83.2 | 79.3 | 81.5 | 86.7 | 78.3 | 80.4 | 89.3 |