mgoin commited on
Commit
390d583
1 Parent(s): 21a28bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -3
README.md CHANGED
@@ -2,9 +2,100 @@
2
  tags:
3
  - fp8
4
  - vllm
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- Run with `vllm==0.6.2` on 4xH100:
8
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  vllm serve neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16 --tensor-parallel-size 4
10
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - fp8
4
  - vllm
5
+ language:
6
+ - en
7
+ - de
8
+ - fr
9
+ - it
10
+ - pt
11
+ - hi
12
+ - es
13
+ - th
14
+ pipeline_tag: text-generation
15
+ license: llama3.2
16
+ base_model: meta-llama/Llama-3.2-90B-Vision-Instruct
17
  ---
18
 
19
+ # Llama-3.2-90B-Vision-Instruct-FP8-dynamic
20
+
21
+ ## Model Overview
22
+ - **Model Architecture:** Meta-Llama-3.2
23
+ - **Input:** Text/Image
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP8
27
+ - **Activation quantization:** FP8
28
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct), this models is intended for assistant-like chat.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30
+ - **Release Date:** 9/25/2024
31
+ - **Version:** 1.0
32
+ - **License(s):** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct/blob/main/LICENSE)
33
+ - **Model Developers:** Neural Magic
34
+
35
+ Quantized version of [Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct).
36
+
37
+ ### Model Optimizations
38
+
39
+ This model was obtained by quantizing the weights and activations of [Llama-3.2-90B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) to FP8 data type, ready for inference with vLLM built from source.
40
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
41
+
42
+ Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
43
+ [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
44
+
45
+ ## Deployment
46
+
47
+ ### Use with vLLM
48
+
49
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
50
+
51
+ ```python
52
  vllm serve neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16 --tensor-parallel-size 4
53
+ ```
54
+
55
+ ## Creation
56
+
57
+ This model was created by applying [LLM Compressor](https://github.com/vllm-project/llm-compressor/blob/f90013702b15bd1690e4e2fe9ed434921b6a6199/examples/quantization_w8a8_fp8/llama3.2_vision_example.py), as presented in the code snipet below.
58
+
59
+ ```python
60
+ from transformers import AutoProcessor, MllamaForConditionalGeneration
61
+
62
+ from llmcompressor.modifiers.quantization import QuantizationModifier
63
+ from llmcompressor.transformers import oneshot, wrap_hf_model_class
64
+
65
+ MODEL_ID = "meta-llama/Llama-3.2-90B-Vision-Instruct"
66
+
67
+ # Load model.
68
+ model_class = wrap_hf_model_class(MllamaForConditionalGeneration)
69
+ model = model_class.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
70
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
71
+
72
+ # Configure the quantization algorithm and scheme.
73
+ # In this case, we:
74
+ # * quantize the weights to fp8 with per channel via ptq
75
+ # * quantize the activations to fp8 with dynamic per token
76
+ recipe = QuantizationModifier(
77
+ targets="Linear",
78
+ scheme="FP8_DYNAMIC",
79
+ ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"],
80
+ )
81
+
82
+ # Apply quantization and save to disk in compressed-tensors format.
83
+ SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
84
+ oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
85
+ processor.save_pretrained(SAVE_DIR)
86
+
87
+ # Confirm generations of the quantized model look sane.
88
+ print("========== SAMPLE GENERATION ==============")
89
+ input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
90
+ output = model.generate(input_ids, max_new_tokens=20)
91
+ print(processor.decode(output[0]))
92
+ print("==========================================")
93
+ ```
94
+
95
+ ## Evaluation
96
+
97
+ TBD
98
+
99
+ ### Reproduction
100
+
101
+ TBD