Lin-K76 commited on
Commit
f722512
1 Parent(s): 97b8185

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -37
README.md CHANGED
@@ -13,54 +13,36 @@ tags:
13
  * <h3 style="display: inline;">Model Developers:</h3> Neural Magic
14
 
15
  Phi-3-mini-128k-instruct quantized to FP8 weights and activations using per-tensor quantization through the [AutoFP8 repository](https://github.com/neuralmagic/AutoFP8), ready for inference with vLLM >= 0.5.0.
16
- Calibrated with 1 repeats of each token in the tokenizer in random order to achieve ~100% performance recovery on the Open LLM Benchmark evaluations.
17
  Reduces space on disk by ~50%.
18
  Part of the [FP8 LLMs for vLLM collection](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
19
 
20
 
21
  ## Usage and Creation
22
- Produced using AutoFP8 with random tokens as calibration, based on [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py).
23
 
24
  ```python
25
  from datasets import load_dataset
26
  from transformers import AutoTokenizer
27
- import numpy as np
28
- import torch
29
 
30
  from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
31
 
32
- MODEL_DIR = "microsoft/Phi-3-mini-128k-instruct"
33
- final_model_dir = MODEL_DIR.split("/")[-1]
34
-
35
- CONTEXT_LENGTH = 4096
36
- NUM_SAMPLES = 512
37
- NUM_REPEATS = 1
38
 
39
- pretrained_model_dir = MODEL_DIR
40
- tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=CONTEXT_LENGTH)
41
  tokenizer.pad_token = tokenizer.eos_token
42
 
43
- tokenizer_num_tokens = len(list(tokenizer.get_vocab().values()))
44
- total_token_samples = NUM_REPEATS * tokenizer_num_tokens
45
- num_random_samp = -(-total_token_samples // CONTEXT_LENGTH)
46
 
47
- input_ids = np.tile(np.arange(tokenizer_num_tokens), NUM_REPEATS + 1)[:num_random_samp * CONTEXT_LENGTH]
48
- np.random.shuffle(input_ids)
49
- input_ids = input_ids.reshape(num_random_samp, CONTEXT_LENGTH)
50
- input_ids = torch.tensor(input_ids, dtype=torch.int64).to("cuda")
51
 
52
- quantize_config = BaseQuantizeConfig(
53
- quant_method="fp8",
54
- activation_scheme="static",
55
  )
56
-
57
- examples = input_ids
58
-
59
- model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config=quantize_config)
60
-
61
  model.quantize(examples)
62
-
63
- quantized_model_dir = f"{final_model_dir}-FP8"
64
  model.save_quantized(quantized_model_dir)
65
  ```
66
 
@@ -948,11 +930,11 @@ Evaluated on the Open LLM Leaderboard evaluations through vLLM.
948
  ### Open LLM Leaderboard evaluation scores
949
  | | Phi-3-mini-128k-instruct-FP8 | neuralmagic/Phi-3-mini-128k-instruct-FP8<br>(this model) |
950
  | :------------------: | :----------------------: | :------------------------------------------------: |
951
- | arc-c<br>25-shot | 63.65 | 64.33 |
952
- | hellaswag<br>10-shot | 79.76 | 79.61 |
953
- | mmlu<br>5-shot | 68.10 | 67.78 |
954
- | truthfulqa<br>0-shot | 53.97 | 52.95 |
955
- | winogrande<br>5-shot | 73.72 | 73.40 |
956
- | gsm8k<br>5-shot | 75.59 | 74.22 |
957
- | **Average<br>Accuracy** | **69.13** | **68.72** |
958
- | **Recovery** | **100%** | **99.40%** |
 
13
  * <h3 style="display: inline;">Model Developers:</h3> Neural Magic
14
 
15
  Phi-3-mini-128k-instruct quantized to FP8 weights and activations using per-tensor quantization through the [AutoFP8 repository](https://github.com/neuralmagic/AutoFP8), ready for inference with vLLM >= 0.5.0.
16
+ Calibrated with 512 UltraChat samples to achieve 100% performance recovery on the Open LLM Benchmark evaluations.
17
  Reduces space on disk by ~50%.
18
  Part of the [FP8 LLMs for vLLM collection](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
19
 
20
 
21
  ## Usage and Creation
22
+ Produced using [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py).
23
 
24
  ```python
25
  from datasets import load_dataset
26
  from transformers import AutoTokenizer
 
 
27
 
28
  from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
29
 
30
+ pretrained_model_dir = "microsoft/Phi-3-mini-128k-instruct"
31
+ quantized_model_dir = "Phi-3-mini-128k-instruct-FP8"
 
 
 
 
32
 
33
+ tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
 
34
  tokenizer.pad_token = tokenizer.eos_token
35
 
36
+ ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
37
+ examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
38
+ examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
39
 
40
+ quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
 
 
 
41
 
42
+ model = AutoFP8ForCausalLM.from_pretrained(
43
+ pretrained_model_dir, quantize_config=quantize_config
 
44
  )
 
 
 
 
 
45
  model.quantize(examples)
 
 
46
  model.save_quantized(quantized_model_dir)
47
  ```
48
 
 
930
  ### Open LLM Leaderboard evaluation scores
931
  | | Phi-3-mini-128k-instruct-FP8 | neuralmagic/Phi-3-mini-128k-instruct-FP8<br>(this model) |
932
  | :------------------: | :----------------------: | :------------------------------------------------: |
933
+ | arc-c<br>25-shot | 63.65 | 64.24 |
934
+ | hellaswag<br>10-shot | 79.76 | 79.79 |
935
+ | mmlu<br>5-shot | 68.10 | 67.93 |
936
+ | truthfulqa<br>0-shot | 53.97 | 53.50 |
937
+ | winogrande<br>5-shot | 73.72 | 74.11 |
938
+ | gsm8k<br>5-shot | 75.59 | 74.37 |
939
+ | **Average<br>Accuracy** | **69.13** | **68.99** |
940
+ | **Recovery** | **100%** | **99.80%** |