Phi-3-mini-128k-instruct-FP8 / README.md

Update README.md

3f04528 verified 4 months ago

4.34 kB

	---
	tags:
	- fp8
	- vllm
	---

	# Phi-3-mini-128k-instruct-FP8

	## Model Overview
	* <h3 style="display: inline;">Model Architecture:</h3> Based on and identical to the Phi-3-mini-128k-instruct architecture
	* <h3 style="display: inline;">Model Optimizations:</h3> Weights and activations quantized to FP8
	* <h3 style="display: inline;">Release Date:</h3> June 29, 2024
	* <h3 style="display: inline;">Model Developers:</h3> Neural Magic

	Phi-3-mini-128k-instruct quantized to FP8 weights and activations using per-tensor quantization through the [AutoFP8 repository](https://github.com/neuralmagic/AutoFP8), ready for inference with vLLM >= 0.5.0.
	Calibrated with 512 UltraChat samples to achieve 100% performance recovery on the Open LLM Benchmark evaluations.
	Reduces space on disk by ~50%.
	Part of the [FP8 LLMs for vLLM collection](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).


	## Usage and Creation
	Produced using [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py).

	```python
	from datasets import load_dataset
	from transformers import AutoTokenizer

	from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

	pretrained_model_dir = "microsoft/Phi-3-mini-128k-instruct"
	quantized_model_dir = "Phi-3-mini-128k-instruct-FP8"

	tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
	tokenizer.pad_token = tokenizer.eos_token

	ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
	examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
	examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

	quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")

	model = AutoFP8ForCausalLM.from_pretrained(
	pretrained_model_dir, quantize_config=quantize_config
	)
	model.quantize(examples)
	model.save_quantized(quantized_model_dir)
	```

	Evaluated through vLLM>=0.5.1 with the following script:

	```bash
	#!/bin/bash

	# Example usage:
	# CUDA_VISIBLE_DEVICES=0 ./eval_openllm.sh "neuralmagic/Phi-3-mini-128k-instruct-FP8" "tensor_parallel_size=1,max_model_len=4096,add_bos_token=True,gpu_memory_utilization=0.7"

	export MODEL_DIR=${1}
	export MODEL_ARGS=${2}

	declare -A tasks_fewshot=(
	["arc_challenge"]=25
	["winogrande"]=5
	["truthfulqa_mc2"]=0
	["hellaswag"]=10
	["mmlu"]=5
	["gsm8k"]=5
	)

	declare -A batch_sizes=(
	["arc_challenge"]="auto"
	["winogrande"]="auto"
	["truthfulqa_mc2"]="auto"
	["hellaswag"]="auto"
	["mmlu"]=1
	["gsm8k"]="auto"
	)

	for TASK in "${!tasks_fewshot[@]}"; do
	NUM_FEWSHOT=${tasks_fewshot[$TASK]}
	BATCH_SIZE=${batch_sizes[$TASK]}
	lm_eval --model vllm \
	--model_args pretrained=$MODEL_DIR,$MODEL_ARGS \
	--tasks ${TASK} \
	--num_fewshot ${NUM_FEWSHOT} \
	--write_out \
	--show_config \
	--device cuda \
	--batch_size ${BATCH_SIZE} \
	--output_path="results/${TASK}"
	done
	```


	## Evaluation

	Evaluated on the Open LLM Leaderboard evaluations through vLLM.

	### Open LLM Leaderboard evaluation scores
	\| \| Phi-3-mini-128k-instruct-FP8 \| neuralmagic/Phi-3-mini-128k-instruct-FP8<br>(this model) \|
	\| :------------------: \| :----------------------: \| :------------------------------------------------: \|
	\| arc-c<br>25-shot \| 63.65 \| 64.24 \|
	\| hellaswag<br>10-shot \| 79.76 \| 79.79 \|
	\| mmlu<br>5-shot \| 68.10 \| 67.93 \|
	\| truthfulqa<br>0-shot \| 53.97 \| 53.50 \|
	\| winogrande<br>5-shot \| 73.72 \| 74.11 \|
	\| gsm8k<br>5-shot \| 75.59 \| 74.37 \|
	\| Average<br>Accuracy \| 69.13 \| 68.99 \|
	\| Recovery \| 100% \| 99.80% \|