File size: 11,078 Bytes
1369992 1071054 4b8b87a 1071054 4b8b87a 1071054 4b8b87a 1369992 1071054 b63df6b 1071054 e75ac79 1071054 e75ac79 1071054 4858110 1071054 e75ac79 1071054 e75ac79 1071054 57e4f84 ac053aa 1071054 ac053aa 1071054 0b7a57f 1071054 cb2002c d37e3d2 b36ec51 d37e3d2 cb2002c b36ec51 de770dc b36ec51 a85d31e 1071054 4b8b87a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 |
---
language:
- en
license: llama3
library_name: transformers
tags:
- tenyx-fine-tuning
- dpo
- tenyxchat
- llama3
datasets:
- HuggingFaceH4/ultrafeedback_binarized
pipeline_tag: text-generation
model-index:
- name: Llama3-TenyxChat-70B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: HuggingFaceH4/ifeval
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 80.87
name: strict accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: BBH
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 49.62
name: normalized accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: hendrycks/competition_math
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 22.66
name: exact match
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 6.82
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 12.52
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 46.78
name: accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=tenyx/Llama3-TenyxChat-70B
name: Open LLM Leaderboard
---
# TenyxChat: Language Model Alignment using Tenyx Fine-tuning
Introducing Llama-3-TenyxChat-70B, part of our TenyxChat series trained to function as useful assistants through preference tuning, using Tenyx's advanced fine-tuning technology ([VentureBeat article](https://venturebeat.com/ai/tenyx-aims-to-fix-llms-catastrophic-forgetting-problem/)). Our model is trained using the [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290) framework on the open-source AI feedback dataset [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized).
We fine-tune [Llama3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) with our proprietary approach
which shows an increase in [MT-Bench](https://arxiv.org/abs/2306.05685)*, without a drop in performance of the model on other benchmarks.
Our approach aims to mitigate forgetting in LLMs in a computationally efficient manner,
thereby enabling continual fine-tuning capabilities without altering the pre-trained output distribution.
Llama-3-TenyxChat-70B was trained using eight A100s (80GB) for fifteen hours, with a training setup obtained from HuggingFaceH4 ([GitHub](https://github.com/huggingface/alignment-handbook)).
*The MT-Bench evaluation we perform follows the latest eval upgrade as PR'd [here](https://github.com/lm-sys/FastChat/pull/3158). This PR upgrades the evaluation from `GPT-4-0613` to `GPT-4-preview-0125` (latest version) as well as corrects and improves the quality of the reference answers for a subset of questions. These changes are required to correct the erroneous rating during previous evaluation.
**Model Developers** [Tenyx Research](https://www.tenyx.com/research)
# Model details
- Model type: Fine-tuned 70B Instruct model for chat.
- License: Meta Llama 3 Community License
- Base model: [Llama3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
- Demo: [HuggingFace Space](https://huggingface.co/spaces/tenyx/Llama3-TenyxChat-70B)
## Usage
Our model uses the same chat template as [Llama3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).
### Hugging face Example
```python
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="tenyx/Llama3-TenyxChat-70B", torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate."},
{"role": "user", "content": "Hi. I would like to make a hotel booking."},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=512, do_sample=False)
```
# Performance
At the time of release (April 2024), Llama3-TenyxChat-70B is the highest-ranked open source model on the MT-Bench evaluation available for download.
## MT-Bench
MT-Bench is a benchmark made up of 80 high-quality multi-turn questions. These questions fall into eight categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, and Humanities. The chat models are rated using `GPT-4-preview-0125` on a scale of 1 to 10, with higher values corresponding to better responses.
| Model-name | GPT4-preview-0125 MT Bench | Chat Arena Elo |
|--------------------------------|----------------------------|----------------|
| GPT-4-1106 | 8.79 | 1251 |
| Claude 3 Opus (20240229) | 8.57 | 1247 |
| **Llama3-TenyxChat-70B** |**8.15** | NA |
| *Llama3-70B-Instruct* | 7.96 | 1207 |
| Claude 3 Sonnet (20240229) | 7.82 | 1190 |
| GPT-4-0314 | 7.96 | 1185 |
| Mixtral | 7.38 | 1114 |
| gpt-3.5-turbo-0613 | 7.37 | 1113 |
| Yi-34B | 6.46 | 1099 |
| gpt-3.5-turbo-0125 | 7.52 | 1096 |
| Llama 2 70B | 6.01 | 1082 |
| NV-Llama2-70B-SteerLM-Chat | 6.57 | 1076 |
![hexplot.png](hexplot_llama3-tenyxchat-70b.png)
## Arena Hard
Arena-Hard is an evaluation tool for instruction-tuned LLMs containing 500 challenging user queries. They prompt GPT-4-1106-preview as judge to compare the models' responses against a baseline model (default: GPT-4-0314).
| Model-name | Score | |
|--------------------------------|--------|---------------------|
| gpt-4-0125-preview | 78.0 | 95% CI: (-1.8, 2.2) |
| claude-3-opus-20240229 | 60.4 | 95% CI: (-2.6, 2.1) |
| gpt-4-0314 | 50.0 | 95% CI: (0.0, 0.0) |
| **tenyx/Llama3-TenyxChat-70B** | **49.0** | 95% CI: (-3.0, 2.4) |
| *meta-llama/Meta-Llama-3-70B-In* | 47.3 | 95% CI: (-1.7, 2.6) |
| claude-3-sonnet-20240229 | 46.8 | 95% CI: (-2.7, 2.3) |
| claude-3-haiku-20240307 | 41.5 | 95% CI: (-2.4, 2.5) |
| gpt-4-0613 | 37.9 | 95% CI: (-2.1, 2.2) |
| mistral-large-2402 | 37.7 | 95% CI: (-2.9, 2.8) |
| Qwen1.5-72B-Chat | 36.1 | 95% CI: (-2.1, 2.4) |
| command-r-plus | 33.1 | 95% CI: (-2.0, 1.9) |
## Open LLM Leaderboard Evaluation
We now present our results on the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) used for benchmarking Open LLM Leaderboard on Hugging Face.
The task involves evaluation on `6` key benchmarks across reasoning and knowledge with different *few-shot* settings. Read more details about the benchmark at [the leaderboard page](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
| Model-name | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **Llama3-TenyxChat-70B** | **79.43** | 72.53 | 86.11 | 79.95 | 62.93 | 83.82 | 91.21 |
| *Llama3-70B-Instruct* | 77.88 | 71.42 | 85.69 | 80.06 | 61.81 | 82.87 | 85.44 |
*The results reported are from local evaluation of our model. `tenyx/Llama3-TenyxChat-70B` is submitted and will be reflected in the leaderboard once evaluation succeeds.
**Note**: While the Open LLM Leaderboard shows other performant Llama-3 fine-tuned models, we observe that these models typically regress in performance and struggle in a multi-turn chat setting, such as the MT-Bench. We present the below comparison with a Llama3 finetune from the leaderboard.
| Model | First Turn | Second Turn | Average |
| --- | --- | --- | --- |
| **tenyx/Llama3-TenyxChat-70B** | 8.12 | 8.18 | 8.15 |
| *meta-llama/Llama3-TenyxChat-70B* | 8.05 | 7.87 | 7.96 |
| MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 | 8.05 | 7.82 | 7.93 |
# Limitations
Llama3-TenyxChat-70B, like other language models, has its own set of limitations. We haven’t fine-tuned the model explicitly to align with **human** safety preferences. Therefore, it is capable of producing undesirable outputs, particularly when adversarially prompted. From our observation, the model still tends to struggle with tasks that involve reasoning and math questions. In some instances, it might generate verbose or extraneous content.
# License
Llama3-TenyxChat-70B is distributed under the Meta Llama 3 Community License.
# Citation
If you use Llama3-TenyxChat-70B for your research, cite us as
```
@misc{tenyxchat2024,
title={TenyxChat: Language Model Alignment using Tenyx Fine-tuning},
author={Tenyx},
year={2024},
}
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_tenyx__Llama3-TenyxChat-70B)
| Metric |Value|
|-------------------|----:|
|Avg. |36.54|
|IFEval (0-Shot) |80.87|
|BBH (3-Shot) |49.62|
|MATH Lvl 5 (4-Shot)|22.66|
|GPQA (0-shot) | 6.82|
|MuSR (0-shot) |12.52|
|MMLU-PRO (5-shot) |46.78|
|