abhinavnmagic
commited on
Commit
•
8824fca
1
Parent(s):
8f06008
Update README.md
Browse files
README.md
CHANGED
@@ -29,7 +29,7 @@ This model was obtained by quantizing the weights of [Phi-3-medium-4k-instruct](
|
|
29 |
This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
|
30 |
|
31 |
Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
|
32 |
-
[
|
33 |
|
34 |
|
35 |
## Deployment
|
@@ -63,12 +63,11 @@ generated_text = outputs[0].outputs[0].text
|
|
63 |
print(generated_text)
|
64 |
```
|
65 |
|
66 |
-
vLLM
|
67 |
|
68 |
### Use with transformers
|
69 |
|
70 |
-
|
71 |
-
The following example contemplates how the model can be used using the `generate()` function.
|
72 |
|
73 |
```python
|
74 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -112,12 +111,12 @@ print(tokenizer.decode(response, skip_special_tokens=True))
|
|
112 |
|
113 |
## Creation
|
114 |
|
115 |
-
This model was created by
|
116 |
-
Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
|
117 |
|
118 |
```python
|
119 |
from transformers import AutoTokenizer
|
120 |
-
from
|
|
|
121 |
from datasets import load_dataset
|
122 |
import random
|
123 |
|
@@ -141,21 +140,27 @@ examples = [
|
|
141 |
) for example in ds
|
142 |
]
|
143 |
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
damp_percent=0.01,
|
150 |
)
|
151 |
|
152 |
-
model =
|
153 |
model_id,
|
154 |
-
quantize_config,
|
155 |
device_map="auto",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
156 |
)
|
157 |
|
158 |
-
model.quantize(examples)
|
159 |
model.save_pretrained("Phi-3-medium-128k-instruct-quantized.w4a16")
|
160 |
```
|
161 |
|
|
|
29 |
This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
|
30 |
|
31 |
Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
|
32 |
+
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. Quantization is performed with 1% damping factor, group-size as 128 and 512 sequences sampled from [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus).
|
33 |
|
34 |
|
35 |
## Deployment
|
|
|
63 |
print(generated_text)
|
64 |
```
|
65 |
|
66 |
+
vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
67 |
|
68 |
### Use with transformers
|
69 |
|
70 |
+
The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
|
|
|
71 |
|
72 |
```python
|
73 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
111 |
|
112 |
## Creation
|
113 |
|
114 |
+
This model was created by using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as presented in the code snipet below.
|
|
|
115 |
|
116 |
```python
|
117 |
from transformers import AutoTokenizer
|
118 |
+
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
|
119 |
+
from llmcompressor.modifiers.quantization import GPTQModifier
|
120 |
from datasets import load_dataset
|
121 |
import random
|
122 |
|
|
|
140 |
) for example in ds
|
141 |
]
|
142 |
|
143 |
+
recipe = GPTQModifier(
|
144 |
+
targets="Linear",
|
145 |
+
scheme="W4A16",
|
146 |
+
ignore=["lm_head"],
|
147 |
+
dampening_frac=0.1,
|
|
|
148 |
)
|
149 |
|
150 |
+
model = SparseAutoModelForCausalLM.from_pretrained(
|
151 |
model_id,
|
|
|
152 |
device_map="auto",
|
153 |
+
trust_remote_code=True,
|
154 |
+
)
|
155 |
+
|
156 |
+
oneshot(
|
157 |
+
model=model,
|
158 |
+
dataset=ds,
|
159 |
+
recipe=recipe,
|
160 |
+
max_seq_length=max_seq_len,
|
161 |
+
num_calibration_samples=num_samples,
|
162 |
)
|
163 |
|
|
|
164 |
model.save_pretrained("Phi-3-medium-128k-instruct-quantized.w4a16")
|
165 |
```
|
166 |
|