inference: false
language:
- en
license: other
model_type: llama
pipeline_tag: text-generation
tags:
- facebook
- meta
- pytorch
- llama
- llama-2
- gptq
Meta's Llama 2 13B GPTQ
These files are GPTQ model files for Meta's Llama 2 13B.
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
Repositories available
- GPTQ models for GPU inference, with multiple quantisation parameter options.
- 2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference
- Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions
Prompt template: None
### System:\n{system}\n\n### User:\n{instruction}\n\n### Response:
Provided files
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
Each separate quant is in a different branch. See below for instructions on fetching from different branches.
Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
---|---|---|---|---|---|---|---|
main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
gptq-8bit-128g-actorder_True | 8 | 128 | True | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
gptq-8bit-64g-actorder_True | 8 | 64 | True | 13.95 GB | False | AutoGPTQ | 8-bit, with group size 64g and Act Order for maximum inference quality. Poor AutoGPTQ CUDA speed. |
gptq-8bit-128g-actorder_False | 8 | 128 | False | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
gptq-8bit--1g-actorder_True | 8 | None | True | 13.36 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
How to download from branches
- In text-generation-webui, you can add
:branch
to the end of the download name, egTheBloke/Llama-2-13B-GPTQ:gptq-4bit-32g-actorder_True
- With Git, you can clone a branch with:
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Llama-2-13B-GPTQ`
- In Python Transformers code, the branch is the
revision
parameter; see below.
How to easily download and use this model in text-generation-webui.
Please make sure you're using the latest version of text-generation-webui.
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
- Click the Model tab.
- Under Download custom model or LoRA, enter
TheBloke/Llama-2-13B-GPTQ
.
- To download from a specific branch, enter for example
TheBloke/Llama-2-13B-GPTQ:gptq-4bit-32g-actorder_True
- see Provided Files above for the list of branches for each option.
- Click Download.
- The model will start downloading. Once it's finished it will say "Done"
- In the top left, click the refresh icon next to Model.
- In the Model dropdown, choose the model you just downloaded:
Llama-2-13B-GPTQ
- The model will automatically load, and is now ready for use!
- If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right.
- Note that you do not need to set GPTQ parameters any more. These are set automatically from the file
quantize_config.json
.
- Once you're ready, click the Text Generation tab and enter a prompt to get started!
How to use this GPTQ model from Python code
First make sure you have AutoGPTQ installed:
GITHUB_ACTIONS=true pip install auto-gptq
Then try the following example code:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import json
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig, get_gptq_peft_model
MODEL_PATH_GPTQ= "Llama-2-13B-GPTQ"
ADAPTER_DIR= "Llama-2-13B-GPTQ-Orca"
DEV = "cuda:0"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH_GPTQ, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(
MODEL_PATH_GPTQ,
use_safetensors=True,
trust_remote_code=False,
use_triton=True,
device="cuda:0",
warmup_triton=False,
trainable=True,
inject_fused_attention=True,
inject_fused_mlp=False,
)
model = get_gptq_peft_model(
model,
model_id=ADAPTER_DIR,
train_mode=False
)
model.eval()
Compatibility
The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.