Create README.md

298b7f4 verified 5 days ago

8.75 kB

	---
	tags:
	- fp8
	- vllm
	language:
	- en
	- de
	- fr
	- it
	- pt
	- hi
	- es
	- th
	pipeline_tag: text-generation
	license: llama3.1
	base_model:
	- nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
	---
	# Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

	## Model Overview
	- Model Architecture: Meta-Llama-3.1
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Weight quantization: FP8
	- Activation quantization: FP8
	- Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to [
	Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF), this models is intended for assistant-like chat.
	- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
	- Release Date: 10/31/2024
	- Version: 1.0
	- License(s): [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
	- Model Developers: mysticbeing
	- Method used to quantize the weights (quant_method) compressed-tensors
	- Weights format float-quantized
	- Architecture LlamaForCausalLM
	- Attention heads 64
	- KV heads 8
	- Hidden Activation [Sigmoid Linear Unit (SiLU)](https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html)

	## Terms of use

	By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md) and [Meta’s privacy policy](https://www.facebook.com/privacy/policy/)

	## Model Details


	## Description:

	Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads.
	It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.

	[Base model - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) description:
	-

	Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.


	Llama-3.1-Nemotron-70B-Instruct-HF model reaches [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0, [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)

	As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.

	As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on [ChatBot Arena leaderboard](https://lmarena.ai/?leaderboard).

	This model was trained using RLHF (specifically, REINFORCE), [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) and [HelpSteer2-Preference prompts](https://huggingface.co/datasets/nvidia/HelpSteer2) on a Llama-3.1-70B-Instruct model as the initial policy.

	See details at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question
	```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:

	```
	Let's count the "R"s in "Strawberry":

	1. S
	2. T
	3. R
	4. A
	5. W
	6. B
	7. E
	8. R
	9. R
	10. Y

	There are 3 "R"s in the word "Strawberry".
	```

	Note: This model is a demonstration of our techniques for improving helpfulness in general-domain instruction following. It has not been tuned for performance in specialized domains such as math.


	### Model Description

	- Quantized (FP8-DYNAMIC) from model: [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF)
	- Model type: Transformer
	- License: [llama3.1]

	## Uses

	Primary Intended Uses:

	General-Domain Instruction Following

	The model is designed for general-purpose instruction following and dialogue tasks
	Optimized specifically for helpfulness in responses
	Focuses on generating coherent, factually-correct, and customizable responses


	Research and Development


	Serves as a demonstration of NVIDIA's techniques for improving model helpfulness
	Can be used by researchers studying instruction-following capabilities
	Provides a benchmark for comparing alignment techniques

	Subject to LLama 3.1 license terms and conditions
	Must adhere to Meta's acceptable use policy and privacy policy
	Maximum input of 128k tokens and output of 4k tokens

	## How to Get Started with the Model

	Use the code below to get started with the model.

	### Use with vLLM

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	MODEL_ID = "mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC"
	N_GPUS = 8
	MAX_MODEL_LEN = 4096
	MAX_TOKENS = 1024

	sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=MAX_TOKENS)

	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "How many r in strawberry?"},
	]

	prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

	llm = LLM(model=MODEL_ID, tensor_parallel_size=N_GPUS, max_model_len=MAX_MODEL_LEN)

	outputs = llm.generate(prompts, sampling_params)

	generated_text = outputs[0].outputs[0].text
	print(generated_text)
	```

	```
	Let's count the "R"s in "Strawberry":

	1. S
	2. T
	3. R
	4. A
	5. W
	6. B
	7. E
	8. R
	9. R
	10. Y

	There are 3 "R"s in the word "Strawberry".
	```

	vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.



	### Out-of-Scope Use

	Any use not complying with LLama 3.1 license

	Applications violating Meta's acceptable use policy

	Uses conflicting with Meta's privacy policy

	Critical Safety Applications

	Applications requiring high reliability or safety guarantees

	Applications where errors could lead to harm or safety issues

	Autonomous Decision Making

	The model is designed to be helpful in responses, not to make independent decisions

	Applications requiring autonomous action without human oversight

	Real-time Processing Requirements

	Applications needing ultra-low latency responses


	## Evaluation


	### Testing Data, Factors & Metrics

	### Results



	## Technical Specifications [optional]

	### Model Architecture and Objective

	## References(s):

	* [FP8 Quantization: The Power of the Exponent](https://arxiv.org/abs/2208.09225)
	* [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF)
	* [NeMo Aligner](https://arxiv.org/abs/2405.01481)
	* [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
	* [HelpSteer2](https://arxiv.org/abs/2406.08673)
	* [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/)
	* [Meta's Llama 3.1 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1)
	* [Meta's Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md)


	## Model Architecture:
	Architecture Type: Transformer <br>
	Network Architecture: Llama 3.1 <br>

	## Input:
	Input Type(s): Text <br>
	Input Format: String <br>
	Input Parameters: One Dimensional (1D) <br>
	Other Properties Related to Input: Max of 128k tokens<br>

	## Output:
	Output Type(s): Text <br>
	Output Format: String <br>
	Output Parameters: One Dimensional (1D) <br>
	Other Properties Related to Output: Max of 4k tokens <br>

	## Software

	Supported Operating System(s): Linux <br>

	## Model Version:
	v1.0

	# Training & Evaluation:

	## Alignment methodology
	* REINFORCE implemented in NeMo Aligner

	# Inference:
	Engine: [vLLM](https://github.com/vllm-project/vllm) <br>
	Test Hardware: H100 (NVIDIA Hopper GPU Micro-architecture) <br>


	## Citation [optional]

	If you find this model useful, please cite the following works
	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]


	## Model Card Authors [optional]


	## Model Card Contact