What are the differences between yours and meta's offical one?
#2
by
c6sneaky
- opened
Here is the link to the official fp8 quant: https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
Meta skipped all layers' QKV/output matrices and the first and last layers completely. This breaks down to:
- Meta FP8: 325B out of 410B params quantized (80%)
- NM FP8: 406B out of 410B params quantized (99%)
With 99.9% recovery and 80GB memory saved for NM
Hi guys! Thank you for your work.
Meta used FBGEMM(https://github.com/pytorch/FBGEMM) and you used LLM Compressor (https://github.com/vllm-project/llm-compressor). I haven’t done extensive research, but could you clarify the main differences in their quantization procedures?