Preparing the Model
Text Generation Inference improves the model in several aspects.
Quantization
TGI supports bits-and-bytes, GPT-Q, AWQ, Marlin, EETQ, EXL2, and fp8 quantization. To speed up inference with quantization, simply set quantize
flag to bitsandbytes
, gptq
, awq
, marlin
, exl2
, eetq
or fp8
depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models here. Similarly, when using AWQ quantization, you need to point to one of these models. To get more information about quantization, please refer to quantization guide
RoPE Scaling
RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass --rope-scaling
, --max-input-length
and --rope-factors
flags when running through CLI. --rope-scaling
can take the values linear
or dynamic
. If your model is not fine-tuned to a longer sequence length, use dynamic
. --rope-factor
is the ratio between the intended max sequence length and the model’s original max sequence length. Make sure to pass --max-input-length
to provide maximum input length for extension.
We recommend using dynamic
RoPE scaling.
Safetensors
Safetensors is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports safetensors
model loading under the hood. By default, given a repository with safetensors
and pytorch
weights, TGI will always load safetensors
. If there’s no pytorch
weights, TGI will convert the weights to safetensors
format.