abhinavnmagic commited on
Commit
8824fca
1 Parent(s): 8f06008

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -16
README.md CHANGED
@@ -29,7 +29,7 @@ This model was obtained by quantizing the weights of [Phi-3-medium-4k-instruct](
29
  This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
30
 
31
  Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
32
- [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 1% damping factor, group-size as 128 and 512 sequences sampled from [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus).
33
 
34
 
35
  ## Deployment
@@ -63,12 +63,11 @@ generated_text = outputs[0].outputs[0].text
63
  print(generated_text)
64
  ```
65
 
66
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
 
68
  ### Use with transformers
69
 
70
- This model is supported by Transformers leveraging the integration with the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) data format.
71
- The following example contemplates how the model can be used using the `generate()` function.
72
 
73
  ```python
74
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -112,12 +111,12 @@ print(tokenizer.decode(response, skip_special_tokens=True))
112
 
113
  ## Creation
114
 
115
- This model was created by applying the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library as presented in the code snipet below.
116
- Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
117
 
118
  ```python
119
  from transformers import AutoTokenizer
120
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 
121
  from datasets import load_dataset
122
  import random
123
 
@@ -141,21 +140,27 @@ examples = [
141
  ) for example in ds
142
  ]
143
 
144
- quantize_config = BaseQuantizeConfig(
145
- bits=4,
146
- group_size=128,
147
- desc_act=True,
148
- model_file_base_name="model",
149
- damp_percent=0.01,
150
  )
151
 
152
- model = AutoGPTQForCausalLM.from_pretrained(
153
  model_id,
154
- quantize_config,
155
  device_map="auto",
 
 
 
 
 
 
 
 
 
156
  )
157
 
158
- model.quantize(examples)
159
  model.save_pretrained("Phi-3-medium-128k-instruct-quantized.w4a16")
160
  ```
161
 
 
29
  This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
30
 
31
  Only the weights of the linear operators within transformers blocks are quantized. Symmetric group-wise quantization is applied, in which a linear scaling per group maps the INT4 and floating point representations of the quantized weights.
32
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. Quantization is performed with 1% damping factor, group-size as 128 and 512 sequences sampled from [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus).
33
 
34
 
35
  ## Deployment
 
63
  print(generated_text)
64
  ```
65
 
66
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
 
68
  ### Use with transformers
69
 
70
+ The following example contemplates how the model can be deployed in Transformers using the `generate()` function.
 
71
 
72
  ```python
73
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
111
 
112
  ## Creation
113
 
114
+ This model was created by using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as presented in the code snipet below.
 
115
 
116
  ```python
117
  from transformers import AutoTokenizer
118
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
119
+ from llmcompressor.modifiers.quantization import GPTQModifier
120
  from datasets import load_dataset
121
  import random
122
 
 
140
  ) for example in ds
141
  ]
142
 
143
+ recipe = GPTQModifier(
144
+ targets="Linear",
145
+ scheme="W4A16",
146
+ ignore=["lm_head"],
147
+ dampening_frac=0.1,
 
148
  )
149
 
150
+ model = SparseAutoModelForCausalLM.from_pretrained(
151
  model_id,
 
152
  device_map="auto",
153
+ trust_remote_code=True,
154
+ )
155
+
156
+ oneshot(
157
+ model=model,
158
+ dataset=ds,
159
+ recipe=recipe,
160
+ max_seq_length=max_seq_len,
161
+ num_calibration_samples=num_samples,
162
  )
163
 
 
164
  model.save_pretrained("Phi-3-medium-128k-instruct-quantized.w4a16")
165
  ```
166