---
language:
- code
license: llama2
model_creator: Meta
model_name: CodeLlama 13B Instruct
inference: false
base_model:
- meta-llama/CodeLlama-13b-Instruct-hf
pipeline_tag: text-generation
tags:
- llama-2
- tensorrt-llm
- code-llama
prompt_template: >
  [INST] Write code to solve the following coding problem that obeys the
  constraints and passes the example test cases. Please wrap your code answer
  using ```:

  {prompt}

  [/INST]
quantized_by: TheBloke
---

# CodeLlama 13B Instruct - GPTQ - TensorRT-LLM - RTX4090

- Model creator: [Meta](https://huggingface.co/meta-llama)
- Original model: [CodeLlama 13B Instruct](https://huggingface.co/meta-llama/CodeLlama-13b-Instruct-hf)
- Quantized model: [TheBloke CodeLlama 13B Instruct - GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ)

## Description

This repo contains TensorRT-LLM GPTQ model files for [Meta's CodeLlama 13B Instruct](https://huggingface.co/meta-llama/CodeLlama-13b-Instruct-hf)
built for a single RTX 4090 card and using tensorrt_llm version 0.15.0.dev2024101500. It's a 4-bit quantized version based on the main branch of
the [TheBloke CodeLlama 13B Instruct - GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ) model.

## TensorRT commands

To build this model, the following commands were run from the base folder of the [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM)
(see installation instructions in the repository for more information):
```shell
python examples/llama/convert_checkpoint.py \
    --model_dir ./CodeLlama-13b-Instruct-hf \
    --output_dir ./CodeLlama-13b-Instruct-hf_checkpoint \
    --dtype float16 \
    --quant_ckpt_path ./CodeLlama-13B-Instruct-GPTQ/model.safetensors \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group
```
And then:
```shell
trtllm-build \
    --checkpoint_dir ./CodeLlama-13b-Instruct-hf_checkpoint \
    --output_dir ./CodeLlama-13B-Instruct-GPTQ_TensorRT \
    --gemm_plugin float16 \
    --max_input_len 8192 \
    --max_seq_len 8192
```

## Prompt template: CodeLlama

```
[INST] <<SYS>>
Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
<</SYS>>

{prompt}
 [/INST] 
```

## How to use this model from Python code

### Using TensorRT-LLM API

#### Install the necessary packages

```shell
pip3 install tensorrt_llm==0.15.0.dev2024101500 -U --pre --extra-index-url https://pypi.nvidia.com
```

Beware that this command should not be run from a virtual environment (or twice, one time outside venv and then using venv).

#### Use the TensorRT-LLM API

```python
from tensorrt_llm import LLM, SamplingParams

system_prompt = \
    "[INST] <<SYS>>\n" +\
    "Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:" +\
    "\n<</SYS>>\n\n"

user_prompt = \
    "<Your user prompt>" +\
    " [/INST] "

prompts = [
    system_prompt + user_prompt,
]
sampling_params = SamplingParams(max_tokens=512, temperature=1.31, top_p=0.14, top_k=49, repetition_penalty=1.17)

llm = LLM(model="./CodeLlama-13B-Instruct-GPTQ_TensorRT")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

### Using Oobabooga's Text Generation WebUI

Follow instructions described here: https://github.com/oobabooga/text-generation-webui/pull/5715
Use version 0.15.0.dev2024101500 of tensorrt_llm instead of 0.10.0.