neuralmagic
/

Phi-3-mini-128k-instruct-FP8

@@ -13,54 +13,36 @@ tags:
 * <h3 style="display: inline;">Model Developers:</h3> Neural Magic
 Phi-3-mini-128k-instruct quantized to FP8 weights and activations using per-tensor quantization through the [AutoFP8 repository](https://github.com/neuralmagic/AutoFP8), ready for inference with vLLM >= 0.5.0.
-Calibrated with 1 repeats of each token in the tokenizer in random order to achieve ~100% performance recovery on the Open LLM Benchmark evaluations.
 Reduces space on disk by ~50%.
 Part of the [FP8 LLMs for vLLM collection](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
 ## Usage and Creation
-Produced using AutoFP8 with random tokens as calibration, based on [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py).
 ```python
 from datasets import load_dataset
 from transformers import AutoTokenizer
-import numpy as np
-import torch
 from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
-MODEL_DIR = "microsoft/Phi-3-mini-128k-instruct"
-final_model_dir = MODEL_DIR.split("/")[-1]
-CONTEXT_LENGTH = 4096
-NUM_SAMPLES = 512
-NUM_REPEATS = 1
-pretrained_model_dir = MODEL_DIR
-tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=CONTEXT_LENGTH)
 tokenizer.pad_token = tokenizer.eos_token
-tokenizer_num_tokens = len(list(tokenizer.get_vocab().values()))
-total_token_samples = NUM_REPEATS * tokenizer_num_tokens
-num_random_samp = -(-total_token_samples // CONTEXT_LENGTH)
-input_ids = np.tile(np.arange(tokenizer_num_tokens), NUM_REPEATS + 1)[:num_random_samp * CONTEXT_LENGTH]
-np.random.shuffle(input_ids)
-input_ids = input_ids.reshape(num_random_samp, CONTEXT_LENGTH)
-input_ids = torch.tensor(input_ids, dtype=torch.int64).to("cuda")
-quantize_config = BaseQuantizeConfig(
-    quant_method="fp8",
-    activation_scheme="static",
 )
-examples = input_ids
-model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config=quantize_config)
 model.quantize(examples)
-quantized_model_dir = f"{final_model_dir}-FP8"
 model.save_quantized(quantized_model_dir)
 ```
@@ -948,11 +930,11 @@ Evaluated on the Open LLM Leaderboard evaluations through vLLM.
 ### Open LLM Leaderboard evaluation scores
 |                      | Phi-3-mini-128k-instruct-FP8       | neuralmagic/Phi-3-mini-128k-instruct-FP8<br>(this model) |
 | :------------------: | :----------------------: | :------------------------------------------------: |
-| arc-c<br>25-shot     | 63.65                    | 64.33                                              |
-| hellaswag<br>10-shot | 79.76                    | 79.61                                              |
-| mmlu<br>5-shot       | 68.10                    | 67.78                                              |
-| truthfulqa<br>0-shot | 53.97                    | 52.95                                              |
-| winogrande<br>5-shot | 73.72                    | 73.40                                              |
-| gsm8k<br>5-shot      | 75.59                    | 74.22                                              |
-| **Average<br>Accuracy**  | **69.13**                    |              **68.72**                                      |
-| **Recovery**             | **100%**                     |              **99.40%**                                     |

 * <h3 style="display: inline;">Model Developers:</h3> Neural Magic
 Phi-3-mini-128k-instruct quantized to FP8 weights and activations using per-tensor quantization through the [AutoFP8 repository](https://github.com/neuralmagic/AutoFP8), ready for inference with vLLM >= 0.5.0.
+Calibrated with 512 UltraChat samples to achieve 100% performance recovery on the Open LLM Benchmark evaluations.
 Reduces space on disk by ~50%.
 Part of the [FP8 LLMs for vLLM collection](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
 ## Usage and Creation
+Produced using [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py).
 ```python
 from datasets import load_dataset
 from transformers import AutoTokenizer
 from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
+pretrained_model_dir = "microsoft/Phi-3-mini-128k-instruct"
+quantized_model_dir = "Phi-3-mini-128k-instruct-FP8"
+tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
 tokenizer.pad_token = tokenizer.eos_token
+ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
+examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
+examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
+quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
+model = AutoFP8ForCausalLM.from_pretrained(
+    pretrained_model_dir, quantize_config=quantize_config
 )
 model.quantize(examples)
 model.save_quantized(quantized_model_dir)
 ```
 ### Open LLM Leaderboard evaluation scores
 |                      | Phi-3-mini-128k-instruct-FP8       | neuralmagic/Phi-3-mini-128k-instruct-FP8<br>(this model) |
 | :------------------: | :----------------------: | :------------------------------------------------: |
+| arc-c<br>25-shot     | 63.65                    | 64.24                                              |
+| hellaswag<br>10-shot | 79.76                    | 79.79                                              |
+| mmlu<br>5-shot       | 68.10                    | 67.93                                              |
+| truthfulqa<br>0-shot | 53.97                    | 53.50                                              |
+| winogrande<br>5-shot | 73.72                    | 74.11                                              |
+| gsm8k<br>5-shot      | 75.59                    | 74.37                                              |
+| **Average<br>Accuracy**  | **69.13**                    |              **68.99**                                      |
+| **Recovery**             | **100%**                     |              **99.80%**                                     |