Quantization
Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they’re quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.
Interested in adding a new quantization method to Transformers? Refer to the Contribute new quantization method guide to learn more about adding a new quantization method.
If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:
When to use what?
This section will be expanded once Diffusers has multiple quantization backends. Currently, we only support bitsandbytes
. This resource provides a good overview of the pros and cons of different quantization techniques.