Quantization Using Brevitas
BrevitasQuantizer
Handles the Runtime quantization process for models shared on huggingface.co/models.
from_pretrained
< source >( model_name_or_path: str subfolder: str = '' revision: Optional = None cache_dir: Optional = None trust_remote_code: bool = False force_download: bool = False local_files_only: bool = False use_auth_token: Union = None device_map: Union = None **model_kwargs )
Parameters
- model_name_or_path (
Union[str, Path]
) — Can be either the model id of a model repo on the Hugging Face Hub, or a path to a local directory containing a model. - subfolder (
str
, defaults to""
) — In case the model files are located inside a subfolder of the model directory / repo on the Hugging Face Hub, you can specify the subfolder name here. - revision (
Optional[str]
, optional, defaults toNone
) — Revision is the specific model version to use. It can be a branch name, a tag name, or a commit id. - cache_dir (
Optional[str]
, optional) — Path to a directory in which a downloaded pretrained model weights have been cached if the standard cache should not be used. - trust_remote_code (
bool
, defaults toFalse
) — Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code present in the model repository. - force_download (
bool
, defaults toFalse
) — Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist. - local_files_only (
Optional[bool]
, defaults toFalse
) — Whether or not to only look at local files (i.e., do not try to download the model). - use_auth_token (
Optional[str]
, defaults toNone
) — The token to use as HTTP bearer authorization for remote files. IfTrue
, will use the token generated when runningtransformers-cli login
(stored in~/.huggingface
).
Loads the BrevitasQuantizer and model.
quantize
< source >( quantization_config: BrevitasQuantizationConfig calibration_dataset: Optional = None )
Parameters
- quantization_config (
BrevitasQuantizationConfig
) — Quantization configuration to use to quantize the model. - calibration_dataset (
Optional[List[Dict]]
, defaults toNone
) — In case the quantization involves a calibration phase, this argument needs to be specified as a list of inputs to the model. Example:calibration_dataset = [{"input_ids": torch.tensor([[1, 2, 3, 4]])}, {"input_ids": torch.tensor([[6, 7, 3, 4]])}]
which is a dataset for a model takinginput_ids
as an argument, and which has two samples.
Quantizes the model using Brevitas according to the quantization_config
.
BrevitasQuantizationConfig
class optimum.amd.BrevitasQuantizationConfig
< source >( weights_bitwidth: int = 8 activations_bitwidth: Optional = 8 weights_only: bool = False weights_param_method: Literal = 'stats' weights_symmetric: bool = True scale_precision: Literal = 'float_scale' weights_quant_granularity: Literal = 'per_tensor' weights_group_size: Optional = None quantize_zero_point: bool = True activations_param_method: Optional = 'stats' is_static: bool = False activations_symmetric: Optional = False activations_quant_granularity: Optional = 'per_tensor' activations_group_size: Optional = None activations_equalization: Optional = 'cross_layer' apply_weight_equalization: bool = False apply_bias_correction: bool = False apply_gptq: bool = False gptq_act_order: Optional = None device: str = 'auto' layers_to_exclude: Optional = None gpu_device_map: Optional = None cpu_device_map: Optional = None )
Parameters
- weights_bitwidth (
int
, defaults to8
) — Bitwidth of the weights quantization. For example, withweights_bitwidth=8
, each weight value is quantized on 8 bits. - activations_bitwidth (
Optional[int]
, defaults to8
) — Bitwidth of the activations quantization. - weights_only (
bool
, defaults toFalse
) — If set toTrue
, only weights are to be quantized, otherwise activations are quantized as well. - weights_param_method (
str
, defaults tostats
) — Strategy to use to estimate the quantization parameters (scale, zero-point) for the weights. Two strategies are available:"stats"
: Use min-max to estimate the range to quantize on."mse"
: Use mean-square error between the unquantized weights and quantized weights to estimate the range to quantize on.
- weights_symmetric (
bool
, defaults toTrue
) — Whether to use symmetric quantization on the weights. - scale_precision (
str
, defaults to"float_scale"
) — Precise the constraints on the scale. Can either be"float_scale"
(arbitrary scales), or"power_of_two_scale"
(scales constrainted to be a power of 2). - weights_quant_granularity (
str
, defaults to"per_tensor"
) — The granularity of the quantization of the weights. This parameter can either be:"per_tensor"
: A single scale (and possibly zero-point) is used for one weight matrix."per_channel"
: Each column (outer dimension) of the weight matrix has its own scale (and possibly zero-point)."per_group"
: Each column of the weight matrix may have several scales, grouped byweight_group_size
.
- weights_group_size (
Optional[int]
, defaults toNone
) — Group size to use for the weights in caseweights_quant_granularity="per_group"
. Defaults to128
in this case, toNone
otherwise. - quantize_zero_point (
bool
, defaults toTrue
) — When set to True, the unquantized value 0.0 is exactly representable as a quantized value: the zero point. When set to False, a quantization range [a, b] is exactly reprensentable (no rounding on a and b), but the unquantized value zero is not exactly representable. - activations_param_method (
List[str]
) — Strategy to use to estimate the quantization parameters (scale, zero-point) for the activations. Two strategies are available:"stats"
: Use min-max to estimate the range to quantize on."mse"
: Use mean-square error between the unquantized activations and quantized activations to estimate the range to quantize on.
- is_static (
bool
, defaults toFalse
) — Whether to apply static quantization or dynamic quantization. - activations_symmetric (
bool
, defaults toFalse
) — Whether to use symmetric quantization on the activations. - activations_quant_granularity (
str
, defaults to"per_tensor"
) — The granularity of the quantization of the activations. This parameter can either be"per_tensor"
,"per_row"
or"per_group"
. In case static quantization is used (is_static=True
), only"per_tensor"
may be used. - activations_group_size (
int
, defaults toNone
) — Group size to use for the activations in caseactivations_quant_granularity="per_group"
. Defaults to64
in this case, toNone
otherwise. - activations_equalization (
Optional[str]
, defaults to"cross_layer"
) — Whether to apply activation equalization (SmoothQuant). Possible options are:None
: No activation equalization."layerwise"
: Apply SmoothQuant as described in https://arxiv.org/abs/2211.10438. The activation rescaling will be added as multiplication node, that is not fused within a preceding layer."cross_layer"
: Apply SmoothQuant, and fuse the activation rescaling within a preceding layer when possible (example: nn.LayerNorm followed by nn.Linear). This is achieved through a graph capture of the model using torch.fx.
- apply_weight_equalization (
bool
, defaults toFalse
) — Applies weight equalization across layers, following https://arxiv.org/abs/1906.04721. This parameter is useful for models whose activation function is linear or piecewise-linear (like ReLU, used in OPT model), and allows to reduce the quantization error of the weights by balancing scales across layers. - apply_bias_correction (
bool
, defaults toFalse
) — Applies bias correction to compensate for changes in activation bias caused by quantization. - apply_gptq (
bool
, defaults toFalse
) — Whether to apply GPTQ algorithm for quantizing the weights. - gptq_act_order (
Optional[bool]
, defaults toNone
) — Whether to use activations reordering (act-order, also known as desc-act) whenapply_gptq=True
. Ifapply_gptq=True
, defaults toFalse
. - layers_to_exclude (
Optional[List]
, defaults toNone
) — Specify the names of the layers that should not be quantized. This should only be the last part of the layer name. If the same name is repeated across multiple layers, they will all be excluded. If left to None, the last linear layer is automatically identified and excluded.
QuantizationConfig is the configuration class handling all the Brevitas quantization parameters.