File size: 11,078 Bytes
1369992
1071054
 
4b8b87a
1071054
 
 
 
 
 
4b8b87a
 
1071054
4b8b87a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1369992
1071054
 
 
 
b63df6b
1071054
 
 
 
 
e75ac79
1071054
 
 
 
 
 
 
e75ac79
1071054
 
4858110
1071054
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e75ac79
1071054
e75ac79
1071054
 
 
57e4f84
ac053aa
1071054
ac053aa
1071054
 
 
 
 
 
 
0b7a57f
1071054
cb2002c
 
 
 
 
d37e3d2
 
 
 
 
b36ec51
 
d37e3d2
 
 
 
 
 
cb2002c
b36ec51
 
 
 
 
de770dc
b36ec51
 
 
 
 
 
a85d31e
 
 
 
 
 
 
1071054
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b8b87a
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
language:
- en
license: llama3
library_name: transformers
tags:
- tenyx-fine-tuning
- dpo
- tenyxchat
- llama3
datasets:
- HuggingFaceH4/ultrafeedback_binarized
pipeline_tag: text-generation
model-index:
- name: Llama3-TenyxChat-70B
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: IFEval (0-Shot)
      type: HuggingFaceH4/ifeval
      args:
        num_few_shot: 0
    metrics:
    - type: inst_level_strict_acc and prompt_level_strict_acc
      value: 80.87
      name: strict accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: BBH (3-Shot)
      type: BBH
      args:
        num_few_shot: 3
    metrics:
    - type: acc_norm
      value: 49.62
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MATH Lvl 5 (4-Shot)
      type: hendrycks/competition_math
      args:
        num_few_shot: 4
    metrics:
    - type: exact_match
      value: 22.66
      name: exact match
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GPQA (0-shot)
      type: Idavidrein/gpqa
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 6.82
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MuSR (0-shot)
      type: TAUR-Lab/MuSR
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 12.52
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU-PRO (5-shot)
      type: TIGER-Lab/MMLU-Pro
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 46.78
      name: accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
      name: Open LLM Leaderboard
---
# TenyxChat: Language Model Alignment using Tenyx Fine-tuning

Introducing Llama-3-TenyxChat-70B, part of our TenyxChat series trained to function as useful assistants through preference tuning, using Tenyx's advanced fine-tuning technology ([VentureBeat article](https://venturebeat.com/ai/tenyx-aims-to-fix-llms-catastrophic-forgetting-problem/)). Our model is trained using the [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290) framework on the open-source AI feedback dataset [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized).

We fine-tune [Llama3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) with our proprietary approach 
which shows an increase in [MT-Bench](https://arxiv.org/abs/2306.05685)*, without a drop in performance of the model on other benchmarks. 
Our approach aims to mitigate forgetting in LLMs in a computationally efficient manner, 
thereby enabling continual fine-tuning capabilities without altering the pre-trained output distribution. 
Llama-3-TenyxChat-70B was trained using eight A100s (80GB) for fifteen hours, with a training setup obtained from HuggingFaceH4 ([GitHub](https://github.com/huggingface/alignment-handbook)).

*The MT-Bench evaluation we perform follows the latest eval upgrade as PR'd [here](https://github.com/lm-sys/FastChat/pull/3158). This PR upgrades the evaluation from `GPT-4-0613` to `GPT-4-preview-0125` (latest version) as well as corrects and improves the quality of the reference answers for a subset of questions. These changes are required to correct the erroneous rating during previous evaluation.  


**Model Developers** [Tenyx Research](https://www.tenyx.com/research)


# Model details

- Model type: Fine-tuned 70B Instruct model for chat.
- License: Meta Llama 3 Community License
- Base model: [Llama3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
- Demo: [HuggingFace Space](https://huggingface.co/spaces/tenyx/Llama3-TenyxChat-70B)

## Usage

Our model uses the same chat template as [Llama3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).

### Hugging face Example

```python
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="tenyx/Llama3-TenyxChat-70B", torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate."},
    {"role": "user", "content": "Hi. I would like to make a hotel booking."},
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=512, do_sample=False)
```


# Performance

At the time of release (April 2024), Llama3-TenyxChat-70B is the highest-ranked open source model on the MT-Bench evaluation available for download. 

## MT-Bench

MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These questions fall into eight categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, and Humanities. The chat models are rated using `GPT-4-preview-0125` on a scale of 1 to 10, with higher values corresponding to better responses.

| Model-name                     | GPT4-preview-0125 MT Bench | Chat Arena Elo |
|--------------------------------|----------------------------|----------------|
| GPT-4-1106                     | 8.79                       | 1251           |
| Claude 3 Opus (20240229)       | 8.57                       | 1247           |
| **Llama3-TenyxChat-70B**       |**8.15**                    | NA             |
| *Llama3-70B-Instruct*          | 7.96                       | 1207           |
| Claude 3 Sonnet (20240229)     | 7.82                       | 1190           |
| GPT-4-0314                     | 7.96                       | 1185           |
| Mixtral                        | 7.38                       | 1114           |
| gpt-3.5-turbo-0613             | 7.37                       | 1113           |
| Yi-34B                         | 6.46                       | 1099           |
| gpt-3.5-turbo-0125             | 7.52                       | 1096           |
| Llama 2 70B                    | 6.01                       | 1082           |
| NV-Llama2-70B-SteerLM-Chat     | 6.57                       | 1076           |

![hexplot.png](hexplot_llama3-tenyxchat-70b.png)


## Arena Hard

Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).

| Model-name                     | Score  |                     |
|--------------------------------|--------|---------------------|
| gpt-4-0125-preview             |  78.0  | 95% CI: (-1.8, 2.2) |
| claude-3-opus-20240229         |  60.4  | 95% CI: (-2.6, 2.1) |
| gpt-4-0314                     |  50.0  | 95% CI:  (0.0, 0.0) |
| **tenyx/Llama3-TenyxChat-70B** |  **49.0**  | 95% CI: (-3.0, 2.4) |
| *meta-llama/Meta-Llama-3-70B-In* |  47.3  | 95% CI: (-1.7, 2.6) |
| claude-3-sonnet-20240229       |  46.8  | 95% CI: (-2.7, 2.3) |
| claude-3-haiku-20240307        |  41.5  | 95% CI: (-2.4, 2.5) |
| gpt-4-0613                     |  37.9  | 95% CI: (-2.1, 2.2) |
| mistral-large-2402             |  37.7  | 95% CI: (-2.9, 2.8) |
| Qwen1.5-72B-Chat               |  36.1  | 95% CI: (-2.1, 2.4) |
| command-r-plus                 |  33.1  | 95% CI: (-2.0, 1.9) |

## Open LLM Leaderboard Evaluation

We now present our results on the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) used for benchmarking Open LLM Leaderboard on Hugging Face. 
The task involves evaluation on `6` key benchmarks across reasoning and knowledge with different *few-shot* settings. Read more details about the benchmark at [the leaderboard page](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

| Model-name | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **Llama3-TenyxChat-70B** | **79.43** | 72.53 | 86.11 | 79.95 | 62.93 | 83.82 | 91.21 |
| *Llama3-70B-Instruct* | 77.88 | 71.42 | 85.69 | 80.06 | 61.81 | 82.87 | 85.44 |

*The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.

**Note**: While the Open LLM Leaderboard shows other performant Llama-3 fine-tuned models, we observe that these models typically regress in performance and struggle in a multi-turn chat setting, such as the MT-Bench. We present the below comparison with a Llama3 finetune from the leaderboard.
| Model | First Turn | Second Turn | Average |
| --- | --- | --- | --- |
| **tenyx/Llama3-TenyxChat-70B** | 8.12 | 8.18 | 8.15 |
| *meta-llama/Llama3-TenyxChat-70B* | 8.05 | 7.87 | 7.96 |
| MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 | 8.05 | 7.82 | 7.93 |

# Limitations

Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content. 

# License

Llama3-TenyxChat-70B is distributed under the Meta Llama 3 Community License.

# Citation

If you use Llama3-TenyxChat-70B for your research, cite us as

```
@misc{tenyxchat2024,
      title={TenyxChat: Language Model Alignment using Tenyx Fine-tuning}, 
      author={Tenyx},
      year={2024},
}
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_tenyx__Llama3-TenyxChat-70B)

|      Metric       |Value|
|-------------------|----:|
|Avg.               |36.54|
|IFEval (0-Shot)    |80.87|
|BBH (3-Shot)       |49.62|
|MATH Lvl 5 (4-Shot)|22.66|
|GPQA (0-shot)      | 6.82|
|MuSR (0-shot)      |12.52|
|MMLU-PRO (5-shot)  |46.78|