mysticbeing commited on
Commit
298b7f4
1 Parent(s): 089f622

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +272 -0
README.md ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp8
4
+ - vllm
5
+ language:
6
+ - en
7
+ - de
8
+ - fr
9
+ - it
10
+ - pt
11
+ - hi
12
+ - es
13
+ - th
14
+ pipeline_tag: text-generation
15
+ license: llama3.1
16
+ base_model:
17
+ - nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
18
+ ---
19
+ # Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC
20
+
21
+ ## Model Overview
22
+ - **Model Architecture:** Meta-Llama-3.1
23
+ - **Input:** Text
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP8
27
+ - **Activation quantization:** FP8
28
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [
29
+ Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF), this models is intended for assistant-like chat.
30
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
31
+ - **Release Date:** 10/31/2024
32
+ - **Version:** 1.0
33
+ - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
34
+ - **Model Developers:** mysticbeing
35
+ - **Method used to quantize the weights (quant_method)** compressed-tensors
36
+ - **Weights format** float-quantized
37
+ - **Architecture** LlamaForCausalLM
38
+ - **Attention heads** 64
39
+ - **KV heads** 8
40
+ - **Hidden Activation** [Sigmoid Linear Unit (SiLU)](https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html)
41
+
42
+ ## Terms of use
43
+
44
+ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md) and [Meta’s privacy policy](https://www.facebook.com/privacy/policy/)
45
+
46
+ ## Model Details
47
+
48
+
49
+ ## Description:
50
+
51
+ Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads.
52
+ It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
53
+
54
+ [Base model - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) description:
55
+ -
56
+
57
+ Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.
58
+
59
+
60
+ Llama-3.1-Nemotron-70B-Instruct-HF model reaches [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0, [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)
61
+
62
+ As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
63
+
64
+ As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on [ChatBot Arena leaderboard](https://lmarena.ai/?leaderboard).
65
+
66
+ This model was trained using RLHF (specifically, REINFORCE), [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) and [HelpSteer2-Preference prompts](https://huggingface.co/datasets/nvidia/HelpSteer2) on a Llama-3.1-70B-Instruct model as the initial policy.
67
+
68
+ See details at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question
69
+ ```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:
70
+
71
+ ```
72
+ Let's count the "R"s in "Strawberry":
73
+
74
+ 1. S
75
+ 2. T
76
+ 3. R
77
+ 4. A
78
+ 5. W
79
+ 6. B
80
+ 7. E
81
+ 8. R
82
+ 9. R
83
+ 10. Y
84
+
85
+ There are **3** "R"s in the word "Strawberry".
86
+ ```
87
+
88
+ Note: This model is a demonstration of our techniques for improving helpfulness in general-domain instruction following. It has not been tuned for performance in specialized domains such as math.
89
+
90
+
91
+ ### Model Description
92
+
93
+ - **Quantized (FP8-DYNAMIC) from model:** [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF)
94
+ - **Model type:** Transformer
95
+ - **License:** [llama3.1]
96
+
97
+ ## Uses
98
+
99
+ Primary Intended Uses:
100
+
101
+ General-Domain Instruction Following
102
+
103
+ The model is designed for general-purpose instruction following and dialogue tasks
104
+ Optimized specifically for helpfulness in responses
105
+ Focuses on generating coherent, factually-correct, and customizable responses
106
+
107
+
108
+ Research and Development
109
+
110
+
111
+ Serves as a demonstration of NVIDIA's techniques for improving model helpfulness
112
+ Can be used by researchers studying instruction-following capabilities
113
+ Provides a benchmark for comparing alignment techniques
114
+
115
+ Subject to LLama 3.1 license terms and conditions
116
+ Must adhere to Meta's acceptable use policy and privacy policy
117
+ Maximum input of 128k tokens and output of 4k tokens
118
+
119
+ ## How to Get Started with the Model
120
+
121
+ Use the code below to get started with the model.
122
+
123
+ ### Use with vLLM
124
+
125
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
126
+
127
+ ```python
128
+ from vllm import LLM, SamplingParams
129
+ from transformers import AutoTokenizer
130
+
131
+ MODEL_ID = "mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC"
132
+ N_GPUS = 8
133
+ MAX_MODEL_LEN = 4096
134
+ MAX_TOKENS = 1024
135
+
136
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=MAX_TOKENS)
137
+
138
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
139
+
140
+ messages = [
141
+ {"role": "system", "content": "You are a helpful assistant."},
142
+ {"role": "user", "content": "How many r in strawberry?"},
143
+ ]
144
+
145
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
146
+
147
+ llm = LLM(model=MODEL_ID, tensor_parallel_size=N_GPUS, max_model_len=MAX_MODEL_LEN)
148
+
149
+ outputs = llm.generate(prompts, sampling_params)
150
+
151
+ generated_text = outputs[0].outputs[0].text
152
+ print(generated_text)
153
+ ```
154
+
155
+ ```
156
+ Let's count the "R"s in "Strawberry":
157
+
158
+ 1. S
159
+ 2. T
160
+ 3. R
161
+ 4. A
162
+ 5. W
163
+ 6. B
164
+ 7. E
165
+ 8. R
166
+ 9. R
167
+ 10. Y
168
+
169
+ There are **3** "R"s in the word "Strawberry".
170
+ ```
171
+
172
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
173
+
174
+
175
+
176
+ ### Out-of-Scope Use
177
+
178
+ Any use not complying with LLama 3.1 license
179
+
180
+ Applications violating Meta's acceptable use policy
181
+
182
+ Uses conflicting with Meta's privacy policy
183
+
184
+ Critical Safety Applications
185
+
186
+ Applications requiring high reliability or safety guarantees
187
+
188
+ Applications where errors could lead to harm or safety issues
189
+
190
+ Autonomous Decision Making
191
+
192
+ The model is designed to be helpful in responses, not to make independent decisions
193
+
194
+ Applications requiring autonomous action without human oversight
195
+
196
+ Real-time Processing Requirements
197
+
198
+ Applications needing ultra-low latency responses
199
+
200
+
201
+ ## Evaluation
202
+
203
+
204
+ ### Testing Data, Factors & Metrics
205
+
206
+ ### Results
207
+
208
+
209
+
210
+ ## Technical Specifications [optional]
211
+
212
+ ### Model Architecture and Objective
213
+
214
+ ## References(s):
215
+
216
+ * [FP8 Quantization: The Power of the Exponent](https://arxiv.org/abs/2208.09225)
217
+ * [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF)
218
+ * [NeMo Aligner](https://arxiv.org/abs/2405.01481)
219
+ * [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
220
+ * [HelpSteer2](https://arxiv.org/abs/2406.08673)
221
+ * [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/)
222
+ * [Meta's Llama 3.1 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1)
223
+ * [Meta's Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md)
224
+
225
+
226
+ ## Model Architecture:
227
+ **Architecture Type:** Transformer <br>
228
+ **Network Architecture:** Llama 3.1 <br>
229
+
230
+ ## Input:
231
+ **Input Type(s):** Text <br>
232
+ **Input Format:** String <br>
233
+ **Input Parameters:** One Dimensional (1D) <br>
234
+ **Other Properties Related to Input:** Max of 128k tokens<br>
235
+
236
+ ## Output:
237
+ **Output Type(s):** Text <br>
238
+ **Output Format:** String <br>
239
+ **Output Parameters:** One Dimensional (1D) <br>
240
+ **Other Properties Related to Output:** Max of 4k tokens <br>
241
+
242
+ ## Software
243
+
244
+ **Supported Operating System(s):** Linux <br>
245
+
246
+ ## Model Version:
247
+ v1.0
248
+
249
+ # Training & Evaluation:
250
+
251
+ ## Alignment methodology
252
+ * REINFORCE implemented in NeMo Aligner
253
+
254
+ # Inference:
255
+ **Engine:** [vLLM](https://github.com/vllm-project/vllm) <br>
256
+ **Test Hardware:** H100 (NVIDIA Hopper GPU Micro-architecture) <br>
257
+
258
+
259
+ ## Citation [optional]
260
+
261
+ If you find this model useful, please cite the following works
262
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
263
+
264
+ **BibTeX:**
265
+
266
+ [More Information Needed]
267
+
268
+
269
+ ## Model Card Authors [optional]
270
+
271
+
272
+ ## Model Card Contact