tokyotech-llm
/

Swallow-7b-instruct-v0.1

@@ -53,7 +53,11 @@ This repository provides large language models developed by [TokyoTech-LLM](http
 ### MT-Bench JA
 * NOTE that the models with the `v0.1` suffix are newer versions compared to their original counterparts with the `hf`.
-* We will add the scores of `Swallow-70b-instruct-hf` and existing models soon.
 |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
 |---|---|---|---|---|---|---|---|---|---|
@@ -62,19 +66,45 @@ This repository provides large language models developed by [TokyoTech-LLM](http
 | Swallow-13b-instruct-v0.1 |0.3669|0.4816|0.5562|0.2769|0.1020|0.1505|0.4179|0.4347|0.5150|
 | Swallow-13b-instruct-hf |0.2004|0.1932|0.2552|0.1507|0.1184|0.1285|0.2641|0.2434|0.2500|
 | Swallow-70b-instruct-v0.1 |0.4513|0.4822|0.5353|0.3497|0.3492|0.2668|0.5553|0.4955|0.5767|
-| Swallow-70b-instruct-hf |N/A|N/A|N/A|N/A|N/A|N/A|N/A|N/A|N/A|
 ## Evaluation Benchmarks
 ### MT-Bench JA
 We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
-We utilized the following artifacts:
 - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
 - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
 ## Usage

 ### MT-Bench JA
 * NOTE that the models with the `v0.1` suffix are newer versions compared to their original counterparts with the `hf`.
+* We report overall (i.e., average over scores of the first and second turns), first, and second turn scores.
+#### Overall
 |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
 |---|---|---|---|---|---|---|---|---|---|
 | Swallow-13b-instruct-v0.1 |0.3669|0.4816|0.5562|0.2769|0.1020|0.1505|0.4179|0.4347|0.5150|
 | Swallow-13b-instruct-hf |0.2004|0.1932|0.2552|0.1507|0.1184|0.1285|0.2641|0.2434|0.2500|
 | Swallow-70b-instruct-v0.1 |0.4513|0.4822|0.5353|0.3497|0.3492|0.2668|0.5553|0.4955|0.5767|
+| Swallow-70b-instruct-hf |0.3259|0.2925|0.4283|0.3447|0.1562|0.1856|0.5634|0.3315|0.3071|
+#### First Turn
+|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
+|---|---|---|---|---|---|---|---|---|---|
+| Swallow-7b-instruct-v0.1 |0.3829|0.4960|0.4800|0.2220|0.2820|0.2164|0.3220|0.5440|0.4980|
+| Swallow-7b-instruct-hf |0.2216|0.2830|0.2150|0.1590|0.1080|0.1470|0.3542|0.2450|0.2650|
+| Swallow-13b-instruct-v0.1 |0.3948|0.5400|0.5220|0.3020|0.1040|0.1760|0.5040|0.5180|0.4920|
+| Swallow-13b-instruct-hf |0.2304|0.2460|0.2640|0.1610|0.1360|0.1330|0.3070|0.3010|0.2950|
+| Swallow-70b-instruct-v0.1 |0.4849|0.5720|0.5020|0.4780|0.3680|0.2467|0.5400|0.5720|0.5960|
+| Swallow-70b-instruct-hf |0.3631|0.3420|0.4007|0.4220|0.1580|0.2044|0.6120|0.4280|0.3360|
+#### Second Turn
+|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
+|---|---|---|---|---|---|---|---|---|---|
+| Swallow-7b-instruct-v0.1 |0.3059|0.3940|0.4640|0.1441|0.1000|0.2253|0.2811|0.3724|0.4449|
+| Swallow-7b-instruct-hf |0.1432|0.1567|0.1798|0.1603|0.1010|0.1085|0.1767|0.1343|0.1295|
+| Swallow-13b-instruct-v0.1 |0.3353|0.4213|0.5911|0.2516|0.1000|0.1244|0.3194|0.3473|0.5394|
+| Swallow-13b-instruct-hf |0.1692|0.1364|0.2453|0.1401|0.1000|0.1237|0.2199|0.1850|0.2050|
+| Swallow-70b-instruct-v0.1 |0.4179|0.3913|0.5689|0.2184|0.3280|0.2884|0.5711|0.4171|0.5562|
+| Swallow-70b-instruct-hf |0.2872|0.2398|0.4564|0.2647|0.1540|0.1676|0.5118|0.2311|0.2762|
 ## Evaluation Benchmarks
 ### MT-Bench JA
 We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
+We utilized the following settings:
 - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
 - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
+- Judge: `gpt-4-1106-preview`
+- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
 ## Usage