|
--- |
|
license: llama3.1 |
|
--- |
|
Amazingly quick to inference on Ampere GPUs like 3090 Ti. in INT8. In VLLM I left it on a task for 10 minutes with prompt caching, average fixed input around 2000, variable input around 200 and output around 200. |
|
Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation. |
|
Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s. |
|
|
|
Quantized on H100. On 3090 Ti I was OOMing. |
|
|
|
Creation script: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
from datasets import Dataset |
|
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot |
|
from llmcompressor.modifiers.quantization import GPTQModifier |
|
import random |
|
|
|
model_id = "NousResearch/Hermes-3-Llama-3.1-8B" |
|
|
|
num_samples = 256 |
|
max_seq_len = 8192 |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
max_token_id = len(tokenizer.get_vocab()) - 1 |
|
input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)] |
|
attention_mask = num_samples * [max_seq_len * [1]] |
|
ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask}) |
|
|
|
recipe = GPTQModifier( |
|
targets="Linear", |
|
scheme="W8A8", |
|
ignore=["lm_head"], |
|
dampening_frac=0.01, |
|
) |
|
|
|
model = SparseAutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
device_map="auto", |
|
) |
|
|
|
oneshot( |
|
model=model, |
|
dataset=ds, |
|
recipe=recipe, |
|
max_seq_length=max_seq_len, |
|
num_calibration_samples=num_samples, |
|
) |
|
|
|
model.save_pretrained("NousResearch_Hermes-3-Llama-3.1-8B.w8a8") |
|
|
|
``` |
|
|