mysticbeing
/

Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

+---
+tags:
+- fp8
+- vllm
+language:
+- en
+- de
+- fr
+- it
+- pt
+- hi
+- es
+- th
+pipeline_tag: text-generation
+license: llama3.1
+base_model:
+- nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
+---
+# Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC
+## Model Overview
+- **Model Architecture:** Meta-Llama-3.1
+  - **Input:** Text
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** FP8
+  - **Activation quantization:** FP8
+- **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [
+Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF), this models is intended for assistant-like chat.
+- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
+- **Release Date:** 10/31/2024
+- **Version:** 1.0
+- **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
+- **Model Developers:** mysticbeing
+- **Method used to quantize the weights (quant_method)** compressed-tensors
+- **Weights format** float-quantized
+- **Architecture** LlamaForCausalLM
+- **Attention heads** 64
+- **KV heads** 8
+- **Hidden Activation** [Sigmoid Linear Unit (SiLU)](https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html)
+## Terms of use
+By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md) and [Meta’s privacy policy](https://www.facebook.com/privacy/policy/)
+## Model Details
+## Description:
+Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads.
+It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
+[Base model - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) description:
+-
+Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.
+Llama-3.1-Nemotron-70B-Instruct-HF model reaches [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0, [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)
+As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
+As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on [ChatBot Arena leaderboard](https://lmarena.ai/?leaderboard).
+This model was trained using RLHF (specifically, REINFORCE), [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) and [HelpSteer2-Preference prompts](https://huggingface.co/datasets/nvidia/HelpSteer2) on a Llama-3.1-70B-Instruct model as the initial policy.
+See details at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question
+```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:
+```
+Let's count the "R"s in "Strawberry":
+1. S
+2. T
+3. R
+4. A
+5. W
+6. B
+7. E
+8. R
+9. R
+10. Y
+There are **3** "R"s in the word "Strawberry".
+```
+Note: This model is a demonstration of our techniques for improving helpfulness in general-domain instruction following. It has not been tuned for performance in specialized domains such as math.
+### Model Description
+- **Quantized (FP8-DYNAMIC) from model:** [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF)
+- **Model type:** Transformer
+- **License:** [llama3.1]
+## Uses
+Primary Intended Uses:
+General-Domain Instruction Following
+The model is designed for general-purpose instruction following and dialogue tasks
+Optimized specifically for helpfulness in responses
+Focuses on generating coherent, factually-correct, and customizable responses
+Research and Development
+Serves as a demonstration of NVIDIA's techniques for improving model helpfulness
+Can be used by researchers studying instruction-following capabilities
+Provides a benchmark for comparing alignment techniques
+Subject to LLama 3.1 license terms and conditions
+Must adhere to Meta's acceptable use policy and privacy policy
+Maximum input of 128k tokens and output of 4k tokens
+## How to Get Started with the Model
+Use the code below to get started with the model.
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+MODEL_ID = "mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC"
+N_GPUS = 8
+MAX_MODEL_LEN = 4096
+MAX_TOKENS = 1024
+sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=MAX_TOKENS)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "How many r in strawberry?"},
+]
+prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+llm = LLM(model=MODEL_ID, tensor_parallel_size=N_GPUS, max_model_len=MAX_MODEL_LEN)
+outputs = llm.generate(prompts, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
+```
+Let's count the "R"s in "Strawberry":
+1. S
+2. T
+3. R
+4. A
+5. W
+6. B
+7. E
+8. R
+9. R
+10. Y
+There are **3** "R"s in the word "Strawberry".
+```
+vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+### Out-of-Scope Use
+Any use not complying with LLama 3.1 license
+Applications violating Meta's acceptable use policy
+Uses conflicting with Meta's privacy policy
+Critical Safety Applications
+Applications requiring high reliability or safety guarantees
+Applications where errors could lead to harm or safety issues
+Autonomous Decision Making
+The model is designed to be helpful in responses, not to make independent decisions
+Applications requiring autonomous action without human oversight
+Real-time Processing Requirements
+Applications needing ultra-low latency responses
+## Evaluation
+### Testing Data, Factors & Metrics
+### Results
+## Technical Specifications [optional]
+### Model Architecture and Objective
+## References(s):
+* [FP8 Quantization: The Power of the Exponent](https://arxiv.org/abs/2208.09225)
+* [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF)
+* [NeMo Aligner](https://arxiv.org/abs/2405.01481)
+* [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
+* [HelpSteer2](https://arxiv.org/abs/2406.08673)
+* [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/)
+* [Meta's Llama 3.1 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1)
+* [Meta's Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md)
+## Model Architecture:
+**Architecture Type:** Transformer <br>
+**Network Architecture:** Llama 3.1 <br>
+## Input:
+**Input Type(s):** Text <br>
+**Input Format:** String <br>
+**Input Parameters:** One Dimensional (1D) <br>
+**Other Properties Related to Input:** Max of 128k tokens<br>
+## Output:
+**Output Type(s):** Text <br>
+**Output Format:** String <br>
+**Output Parameters:** One Dimensional (1D) <br>
+**Other Properties Related to Output:**  Max of 4k tokens <br>
+## Software
+**Supported Operating System(s):** Linux <br>
+## Model Version:
+v1.0
+# Training & Evaluation:
+## Alignment methodology
+* REINFORCE implemented in NeMo Aligner
+# Inference:
+**Engine:** [vLLM](https://github.com/vllm-project/vllm) <br>
+**Test Hardware:** H100 (NVIDIA Hopper GPU Micro-architecture) <br>
+## Citation [optional]
+If you find this model useful, please cite the following works
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+## Model Card Authors [optional]
+## Model Card Contact