adamo1139
/

Hermes-3-Llama-3.1-8B_W8A8

8-bit precision

compressed-tensors

Model card Files Files and versions Community

adamo1139 commited on 23 days ago

Commit

20f6d69

•

1 Parent(s): 2bd9edb

Update README.md

Files changed (1) hide show

README.md +44 -0

README.md CHANGED Viewed

@@ -5,3 +5,47 @@ Amazingly quick to inference on Ada GPUs like 3090 Ti. in INT8. In VLLM I left i
 Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
 Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.

 Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
 Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
+Creation script:
+```python
+from transformers import AutoTokenizer
+from datasets import Dataset
+from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+import random
+model_id = "NousResearch/Hermes-3-Llama-3.1-8B"
+num_samples = 256
+max_seq_len = 8192
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+max_token_id = len(tokenizer.get_vocab()) - 1
+input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)]
+attention_mask = num_samples * [max_seq_len * [1]]
+ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})
+recipe = GPTQModifier(
+  targets="Linear",
+  scheme="W8A8",
+  ignore=["lm_head"],
+  dampening_frac=0.01,
+)
+model = SparseAutoModelForCausalLM.from_pretrained(
+  model_id,
+  device_map="auto",
+)
+oneshot(
+  model=model,
+  dataset=ds,
+  recipe=recipe,
+  max_seq_length=max_seq_len,
+  num_calibration_samples=num_samples,
+)
+model.save_pretrained("NousResearch_Hermes-3-Llama-3.1-8B.w8a8")
+```