File size: 9,582 Bytes
298b7f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5460d32
298b7f4
 
 
 
 
 
27197b3
298b7f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8088938
 
 
 
 
 
 
 
 
59e1600
8088938
aa4f24d
298b7f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
tags:
- fp8
- vllm
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
pipeline_tag: text-generation
license: llama3.1
base_model:
- nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
---
# Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

## Model Overview
- **Model Architecture:** Llama-3.1-Nemotron
  - **Input:** Text
  - **Output:** Text
- **Model Optimizations:**
  - **Weight quantization:** FP8
  - **Activation quantization:** FP8
- **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [
Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF), this model is intended for chat between a user and AI assistant.
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
- **Release Date:** 10/31/2024
- **Version:** 1.0
- **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
- **Model Developers:** mysticbeing
- **Method used to quantize the weights (quant_method)** compressed-tensors
- **Weights format** float-quantized
- **Architecture** LlamaForCausalLM
- **Attention heads** 64
- **KV heads** 8
- **Hidden Activation** [Sigmoid Linear Unit (SiLU)](https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html)

## Terms of use

By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md) and [Meta’s privacy policy](https://www.facebook.com/privacy/policy/)

## Model Details


## Description:

Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads.
It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.

### Quantized models are eco-friendly and cost-effective
FP8 quantized models require significantly less storage compared to traditional 32-bit (FP32) or even 16-bit (FP16) models. 
This reduction can be seen in the total file size comparison, where the FP8 model set is nearly half the size of the higher-precision set.
This efficiency enables easier distribution, storage, and access to powerful AI models, even on devices with limited capacity.

Lower hardware requirements mean reduced costs for businesses and public institutions adopting AI solutions. Small businesses, startups, and government entities, which may lack extensive AI budgets, can leverage high-performance, 
FP8 quantized models to solve problems with half the infrastructure cost.


<img src="https://cdn-uploads.huggingface.co/production/uploads/6590c65952dc1046ca0f13fe/WBVaZgiCklrdg_cy7qqza.png" alt="drawing" width="600"/>

[Base model description - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF):

Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.


Llama-3.1-Nemotron-70B-Instruct-HF model reaches [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0, [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)

As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.

As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on [ChatBot Arena leaderboard](https://lmarena.ai/?leaderboard).

This model was trained using RLHF (specifically, REINFORCE), [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) and [HelpSteer2-Preference prompts](https://huggingface.co/datasets/nvidia/HelpSteer2) on a Llama-3.1-70B-Instruct model as the initial policy.

See details at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question 
```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:

```
Let's count the "R"s in "Strawberry":

1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y

There are **3** "R"s in the word "Strawberry".
```

Note: This model is a demonstration of our techniques for improving helpfulness in general-domain instruction following. It has not been tuned for performance in specialized domains such as math.


### Model Description

- **Quantized (FP8-DYNAMIC) from model:** [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF)
- **Model type:** Transformer
- **License:** [llama3.1]

## Uses

Primary Intended Uses:

General-Domain Instruction Following

The model is designed for general-purpose instruction following and dialogue tasks
Optimized specifically for helpfulness in responses
Focuses on generating coherent, factually-correct, and customizable responses


Research and Development


Serves as a demonstration of NVIDIA's techniques for improving model helpfulness
Can be used by researchers studying instruction-following capabilities
Provides a benchmark for comparing alignment techniques

Subject to LLama 3.1 license terms and conditions
Must adhere to Meta's acceptable use policy and privacy policy
Maximum input of 128k tokens and output of 4k tokens

## How to Get Started with the Model

Use the code below to get started with the model.

### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

MODEL_ID = "mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC"
N_GPUS = 8
MAX_MODEL_LEN = 4096
MAX_TOKENS = 1024

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=MAX_TOKENS)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "How many r in strawberry?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=MODEL_ID, tensor_parallel_size=N_GPUS, max_model_len=MAX_MODEL_LEN)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
```

```
Let's count the "R"s in "Strawberry":

1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y

There are **3** "R"s in the word "Strawberry".
```

vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.



### Out-of-Scope Use

Any use not complying with LLama 3.1 license

Applications violating Meta's acceptable use policy

Uses conflicting with Meta's privacy policy

Critical Safety Applications

Applications requiring high reliability or safety guarantees

Applications where errors could lead to harm or safety issues

Autonomous Decision Making

The model is designed to be helpful in responses, not to make independent decisions

Applications requiring autonomous action without human oversight

Real-time Processing Requirements

Applications needing ultra-low latency responses


## Evaluation


### Testing Data, Factors & Metrics

### Results



## Technical Specifications [optional]

### Model Architecture and Objective

## References(s):

* [FP8 Quantization: The Power of the Exponent](https://arxiv.org/abs/2208.09225)
* [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF)
* [NeMo Aligner](https://arxiv.org/abs/2405.01481)
* [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
* [HelpSteer2](https://arxiv.org/abs/2406.08673)
* [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/) 
* [Meta's Llama 3.1 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1) 
* [Meta's Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md)
 

## Model Architecture: 
**Architecture Type:** Transformer <br>
**Network Architecture:** Llama 3.1 <br>

## Input:
**Input Type(s):** Text <br>
**Input Format:** String <br>
**Input Parameters:** One Dimensional (1D) <br>
**Other Properties Related to Input:** Max of 128k tokens<br>

## Output:
**Output Type(s):** Text <br>
**Output Format:** String <br>
**Output Parameters:** One Dimensional (1D) <br>
**Other Properties Related to Output:**  Max of 4k tokens <br>

## Software

**Supported Operating System(s):** Linux <br>

## Model Version: 
v1.0

# Training & Evaluation: 

## Alignment methodology
* REINFORCE implemented in NeMo Aligner 

# Inference:
**Engine:** [vLLM](https://github.com/vllm-project/vllm) <br>
**Test Hardware:** H100 (NVIDIA Hopper GPU Micro-architecture) <br>


## Citation [optional]

If you find this model useful, please cite the following works
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**