---
language:
- code
license: llama2
model_creator: Meta
model_name: CodeLlama 13B Instruct
inference: false
base_model:
- meta-llama/CodeLlama-13b-Instruct-hf
pipeline_tag: text-generation
tags:
- llama-2
- tensorrt-llm
- code-llama
prompt_template: >
[INST] Write code to solve the following coding problem that obeys the
constraints and passes the example test cases. Please wrap your code answer
using ```:
{prompt}
[/INST]
quantized_by: TheBloke
---
# CodeLlama 13B Instruct - GPTQ - TensorRT-LLM - RTX4090
- Model creator: [Meta](https://huggingface.co/meta-llama)
- Original model: [CodeLlama 13B Instruct](https://huggingface.co/meta-llama/CodeLlama-13b-Instruct-hf)
- Quantized model: [TheBloke CodeLlama 13B Instruct - GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ)
## Description
This repo contains TensorRT-LLM GPTQ model files for [Meta's CodeLlama 13B Instruct](https://huggingface.co/meta-llama/CodeLlama-13b-Instruct-hf)
built for a single RTX 4090 card and using tensorrt_llm version 0.15.0.dev2024101500. It's a 4-bit quantized version based on the main branch of
the [TheBloke CodeLlama 13B Instruct - GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ) model.
## TensorRT commands
To build this model, the following commands were run from the base folder of the [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM)
(see installation instructions in the repository for more information):
```shell
python examples/llama/convert_checkpoint.py \
--model_dir ./CodeLlama-13b-Instruct-hf \
--output_dir ./CodeLlama-13b-Instruct-hf_checkpoint \
--dtype float16 \
--quant_ckpt_path ./CodeLlama-13B-Instruct-GPTQ/model.safetensors \
--use_weight_only \
--weight_only_precision int4_gptq \
--per_group
```
And then:
```shell
trtllm-build \
--checkpoint_dir ./CodeLlama-13b-Instruct-hf_checkpoint \
--output_dir ./CodeLlama-13B-Instruct-GPTQ_TensorRT \
--gemm_plugin float16 \
--max_input_len 8192 \
--max_seq_len 8192
```
## Prompt template: CodeLlama
```
[INST] <>
Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
<>
{prompt}
[/INST]
```
## How to use this model from Python code
### Using TensorRT-LLM API
#### Install the necessary packages
```shell
pip3 install tensorrt_llm==0.15.0.dev2024101500 -U --pre --extra-index-url https://pypi.nvidia.com
```
Beware that this command should not be run from a virtual environment (or twice, one time outside venv and then using venv).
#### Use the TensorRT-LLM API
```python
from tensorrt_llm import LLM, SamplingParams
system_prompt = \
"[INST] <>\n" +\
"Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:" +\
"\n<>\n\n"
user_prompt = \
"" +\
" [/INST] "
prompts = [
system_prompt + user_prompt,
]
sampling_params = SamplingParams(max_tokens=512, temperature=1.31, top_p=0.14, top_k=49, repetition_penalty=1.17)
llm = LLM(model="./CodeLlama-13B-Instruct-GPTQ_TensorRT")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
### Using Oobabooga's Text Generation WebUI
Follow instructions described here: https://github.com/oobabooga/text-generation-webui/pull/5715
Use version 0.15.0.dev2024101500 of tensorrt_llm instead of 0.10.0.