namespace-Pt
commited on
Commit
•
dc2febd
1
Parent(s):
7ff4ffe
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
@@ -34,12 +34,12 @@ We evaluate the model on [LongBench](https://arxiv.org/abs/2308.14508) using 32K
|
|
34 |
## InfiniteBench
|
35 |
We evaluate the model on [InfiniteBench](https://arxiv.org/pdf/2402.13718.pdf) using 80K context length and the official prompt template. The results of GPT4 is copied from the [paper](https://arxiv.org/pdf/2402.13718.pdf). For [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), we use 8K context length.
|
36 |
|
37 |
-
|Model|LongBookQA Eng|
|
38 |
-
|
39 |
-
|GPT4|22.22|
|
40 |
-
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|7.00
|
41 |
-
|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)|20.30|
|
42 |
-
|[Llama-3-8B-Instruct-80K-QLoRA]()|**30.92**|
|
43 |
|
44 |
## Topic Retrieval
|
45 |
We evaluate the model on [Topic Retrieval](https://lmsys.org/blog/2023-06-29-longchat/) task with `[5,10,15,20,25,30,40,50,60,70]` topics.
|
@@ -52,6 +52,8 @@ We evaluate the model's zero-shot performance on MMLU benchmark as a reflection
|
|
52 |
|
53 |
|Model|STEM|Social Sciences|Humanities|Others|Avg|
|
54 |
|:-:|:-:|:-:|:-:|:-:|:-:|
|
|
|
|
|
55 |
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|**53.87**|**75.66**|**69.44**|**69.75**|**65.91**|
|
56 |
|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)|52.10|73.26|67.15|69.80|64.34|
|
57 |
|[Llama-3-8B-Instruct-80K-QLoRA]()|53.10|73.24|67.32|68.79|64.44|
|
|
|
34 |
## InfiniteBench
|
35 |
We evaluate the model on [InfiniteBench](https://arxiv.org/pdf/2402.13718.pdf) using 80K context length and the official prompt template. The results of GPT4 is copied from the [paper](https://arxiv.org/pdf/2402.13718.pdf). For [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), we use 8K context length.
|
36 |
|
37 |
+
|Model|LongBookQA Eng|LongBookSum Eng|
|
38 |
+
|:-:|:-:|:-:|
|
39 |
+
|GPT4|22.22|14.73|
|
40 |
+
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|7.00|**16.40**|
|
41 |
+
|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)|20.30|10.34|
|
42 |
+
|[Llama-3-8B-Instruct-80K-QLoRA]()|**30.92**|14.73|
|
43 |
|
44 |
## Topic Retrieval
|
45 |
We evaluate the model on [Topic Retrieval](https://lmsys.org/blog/2023-06-29-longchat/) task with `[5,10,15,20,25,30,40,50,60,70]` topics.
|
|
|
52 |
|
53 |
|Model|STEM|Social Sciences|Humanities|Others|Avg|
|
54 |
|:-:|:-:|:-:|:-:|:-:|:-:|
|
55 |
+
|[Llama-2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|35.92|54.37|51.74|51.42|47.22|
|
56 |
+
|[Mistral-7B-v0.2-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)|48.79|69.95|64.99|61.64|60.10|
|
57 |
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|**53.87**|**75.66**|**69.44**|**69.75**|**65.91**|
|
58 |
|[gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k)|52.10|73.26|67.15|69.80|64.34|
|
59 |
|[Llama-3-8B-Instruct-80K-QLoRA]()|53.10|73.24|67.32|68.79|64.44|
|