Fix active number of params in model card
#13
by
lewtun
HF staff
- opened
README.md
CHANGED
@@ -18,9 +18,9 @@ inference:
|
|
18 |
<img src="https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1/resolve/main/logo.png" alt="Zephyr 141B Logo" width="400" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
19 |
|
20 |
|
21 |
-
# Model Card for Zephyr 141B-
|
22 |
|
23 |
-
Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr 141B-
|
24 |
|
25 |
> [!NOTE]
|
26 |
> This model was trained collaboratively between Argilla, KAIST, and Hugging Face
|
@@ -31,7 +31,7 @@ Zephyr is a series of language models that are trained to act as helpful assista
|
|
31 |
|
32 |
<!-- Provide a longer summary of what this model is. -->
|
33 |
|
34 |
-
- **Model type:** A Mixture of Experts (MoE) model with 141B total parameters and
|
35 |
- **Language(s) (NLP):** Primarily English.
|
36 |
- **License:** Apache 2.0
|
37 |
- **Finetuned from model:** [mistral-community/Mixtral-8x22B-v0.1](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1)
|
@@ -45,11 +45,11 @@ Zephyr is a series of language models that are trained to act as helpful assista
|
|
45 |
|
46 |
## Performance
|
47 |
|
48 |
-
Zephyr 141B-
|
49 |
|
50 |
| Model | MT Bench | IFEval | BBH | AGIEval |
|
51 |
|-----------------------------------------------------------------------------------------------------|---------:|-------:|------:|--------:|
|
52 |
-
| [zephyr-orpo-141b-
|
53 |
| [databricks/dbrx-instruct](https://huggingface.co/databricks/dbrx-instruct) | 8.26 | 52.13 | 48.50 | 41.16 |
|
54 |
| [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | 8.30 | 55.08 | 45.31 | 47.68 |
|
55 |
|
@@ -93,7 +93,7 @@ print(outputs[0]["generated_text"][-1]["content"])
|
|
93 |
|
94 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
95 |
|
96 |
-
Zephyr 141B-
|
97 |
It is also unknown what the size and composition of the corpus was used to train the base model (`mistral-community/Mixtral-8x22B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the [Falcon 180B model card](https://huggingface.co/tiiuae/falcon-180B#training-data) for an example of this.
|
98 |
|
99 |
|
@@ -115,9 +115,6 @@ The following hyperparameters were used during training:
|
|
115 |
- lr_scheduler_warmup_steps: 100
|
116 |
- num_epochs: 3
|
117 |
|
118 |
-
### Training results
|
119 |
-
|
120 |
-
|
121 |
|
122 |
### Framework versions
|
123 |
|
@@ -128,7 +125,7 @@ The following hyperparameters were used during training:
|
|
128 |
|
129 |
## Citation
|
130 |
|
131 |
-
If you find Zephyr 141B-
|
132 |
|
133 |
```
|
134 |
@misc{hong2024orpo,
|
@@ -146,7 +143,7 @@ You may also wish to cite the creators of this model:
|
|
146 |
```
|
147 |
@misc{zephyr_141b,
|
148 |
author = {Alvaro Bartolome and Jiwoo Hong and Noah Lee and Kashif Rasul and Lewis Tunstall},
|
149 |
-
title = {Zephyr 141B
|
150 |
year = {2024},
|
151 |
publisher = {Hugging Face},
|
152 |
journal = {Hugging Face repository},
|
|
|
18 |
<img src="https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1/resolve/main/logo.png" alt="Zephyr 141B Logo" width="400" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
19 |
|
20 |
|
21 |
+
# Model Card for Zephyr 141B-A39B
|
22 |
|
23 |
+
Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr 141B-A39B is the latest model in the series, and is a fine-tuned version of [mistral-community/Mixtral-8x22B-v0.1](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1) that was trained using a novel alignment algorithm called [Odds Ratio Preference Optimization (ORPO)](https://huggingface.co/papers/2403.07691) with **7k instances** for **1.3 hours** on 4 nodes of 8 x H100s. ORPO does not require an SFT step to achieve high performance and is thus much more computationally efficient than methods like DPO and PPO. To train Zephyr-141B-A39B, we used the [`argilla/distilabel-capybara-dpo-7k-binarized`](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized) preference dataset, which consists of synthetic, high-quality, multi-turn preferences that have been scored via LLMs.
|
24 |
|
25 |
> [!NOTE]
|
26 |
> This model was trained collaboratively between Argilla, KAIST, and Hugging Face
|
|
|
31 |
|
32 |
<!-- Provide a longer summary of what this model is. -->
|
33 |
|
34 |
+
- **Model type:** A Mixture of Experts (MoE) model with 141B total parameters and 39B active parameters. (We initially made a small error in calculating the number of active parameters for the model ID. The model card states the correct number.) Fine-tuned on a mix of publicly available, synthetic datasets.
|
35 |
- **Language(s) (NLP):** Primarily English.
|
36 |
- **License:** Apache 2.0
|
37 |
- **Finetuned from model:** [mistral-community/Mixtral-8x22B-v0.1](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1)
|
|
|
45 |
|
46 |
## Performance
|
47 |
|
48 |
+
Zephyr 141B-A39B was trained to test the effectiveness of ORPO at scale and the underlying dataset contains a mix of general chat capabilities. It achieves strong performance on chat benchmarks like [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [IFEval](https://arxiv.org/abs/2311.07911). The scores reported below were obtained using the [LightEval](https://github.com/huggingface/lighteval) evaluation suite and each prompt has been formatted with the model's corresponding chat template to simulate real-world usage. This is why some scores may differ from those reported in technical reports or on the Open LLM Leaderboard.
|
49 |
|
50 |
| Model | MT Bench | IFEval | BBH | AGIEval |
|
51 |
|-----------------------------------------------------------------------------------------------------|---------:|-------:|------:|--------:|
|
52 |
+
| [zephyr-orpo-141b-A39b-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1) | 8.17 | 65.06 | 58.96 | 44.16 |
|
53 |
| [databricks/dbrx-instruct](https://huggingface.co/databricks/dbrx-instruct) | 8.26 | 52.13 | 48.50 | 41.16 |
|
54 |
| [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | 8.30 | 55.08 | 45.31 | 47.68 |
|
55 |
|
|
|
93 |
|
94 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
95 |
|
96 |
+
Zephyr 141B-A39B has not been aligned to human preferences for safety within the RLHF phase or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so).
|
97 |
It is also unknown what the size and composition of the corpus was used to train the base model (`mistral-community/Mixtral-8x22B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the [Falcon 180B model card](https://huggingface.co/tiiuae/falcon-180B#training-data) for an example of this.
|
98 |
|
99 |
|
|
|
115 |
- lr_scheduler_warmup_steps: 100
|
116 |
- num_epochs: 3
|
117 |
|
|
|
|
|
|
|
118 |
|
119 |
### Framework versions
|
120 |
|
|
|
125 |
|
126 |
## Citation
|
127 |
|
128 |
+
If you find Zephyr 141B-A39B is useful in your work, please cite the ORPO paper:
|
129 |
|
130 |
```
|
131 |
@misc{hong2024orpo,
|
|
|
143 |
```
|
144 |
@misc{zephyr_141b,
|
145 |
author = {Alvaro Bartolome and Jiwoo Hong and Noah Lee and Kashif Rasul and Lewis Tunstall},
|
146 |
+
title = {Zephyr 141B A39B},
|
147 |
year = {2024},
|
148 |
publisher = {Hugging Face},
|
149 |
journal = {Hugging Face repository},
|