adamo1139 commited on
Commit
20f6d69
1 Parent(s): 2bd9edb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md CHANGED
@@ -5,3 +5,47 @@ Amazingly quick to inference on Ada GPUs like 3090 Ti. in INT8. In VLLM I left i
5
  Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
6
  Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  Averaged over a second, that's 22.5k t/s prompt processing and 1.5k t/s generation.
6
  Averaged over an hour that's 81M input tokens and 5.5M output tokens. Peak generation speed I see is around 2.6k/2.8k t/s.
7
 
8
+ Creation script:
9
+
10
+ ```python
11
+ from transformers import AutoTokenizer
12
+ from datasets import Dataset
13
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
14
+ from llmcompressor.modifiers.quantization import GPTQModifier
15
+ import random
16
+
17
+ model_id = "NousResearch/Hermes-3-Llama-3.1-8B"
18
+
19
+ num_samples = 256
20
+ max_seq_len = 8192
21
+
22
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
23
+
24
+ max_token_id = len(tokenizer.get_vocab()) - 1
25
+ input_ids = [[random.randint(0, max_token_id) for _ in range(max_seq_len)] for _ in range(num_samples)]
26
+ attention_mask = num_samples * [max_seq_len * [1]]
27
+ ds = Dataset.from_dict({"input_ids": input_ids, "attention_mask": attention_mask})
28
+
29
+ recipe = GPTQModifier(
30
+ targets="Linear",
31
+ scheme="W8A8",
32
+ ignore=["lm_head"],
33
+ dampening_frac=0.01,
34
+ )
35
+
36
+ model = SparseAutoModelForCausalLM.from_pretrained(
37
+ model_id,
38
+ device_map="auto",
39
+ )
40
+
41
+ oneshot(
42
+ model=model,
43
+ dataset=ds,
44
+ recipe=recipe,
45
+ max_seq_length=max_seq_len,
46
+ num_calibration_samples=num_samples,
47
+ )
48
+
49
+ model.save_pretrained("NousResearch_Hermes-3-Llama-3.1-8B.w8a8")
50
+
51
+ ```