FLAN-UL2 performance INT8 worse than BF16

by nelsonspbr - opened

I am running inference following https://huggingface.co/google/flan-ul2#running-the-model. I tested both INT8 load_in_8bit and BF16 torch_dtype=torch.bfloat16 methods. After running some experiments, INT8 is ~3x slower than BF16. For reference, these are the most executed kernels for INT8:


Is this "INT8" actually mixed precision? Would that start to explain why it is worse?

Sign up or log in to comment