--- language: - code license: llama2 model_creator: Meta model_name: CodeLlama 13B Instruct inference: false base_model: - meta-llama/CodeLlama-13b-Instruct-hf pipeline_tag: text-generation tags: - llama-2 - tensorrt-llm - code-llama prompt_template: > [INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```: {prompt} [/INST] quantized_by: TheBloke --- # CodeLlama 13B Instruct - GPTQ - TensorRT-LLM - RTX4090 - Model creator: [Meta](https://huggingface.co/meta-llama) - Original model: [CodeLlama 13B Instruct](https://huggingface.co/meta-llama/CodeLlama-13b-Instruct-hf) - Quantized model: [TheBloke CodeLlama 13B Instruct - GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ) ## Description This repo contains TensorRT-LLM GPTQ model files for [Meta's CodeLlama 13B Instruct](https://huggingface.co/meta-llama/CodeLlama-13b-Instruct-hf) built for a single RTX 4090 card and using tensorrt_llm version 0.15.0.dev2024101500. It's a 4-bit quantized version based on the main branch of the [TheBloke CodeLlama 13B Instruct - GPTQ](https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GPTQ) model. ## TensorRT commands To build this model, the following commands were run from the base folder of the [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM) (see installation instructions in the repository for more information): ```shell python examples/llama/convert_checkpoint.py \ --model_dir ./CodeLlama-13b-Instruct-hf \ --output_dir ./CodeLlama-13b-Instruct-hf_checkpoint \ --dtype float16 \ --quant_ckpt_path ./CodeLlama-13B-Instruct-GPTQ/model.safetensors \ --use_weight_only \ --weight_only_precision int4_gptq \ --per_group ``` And then: ```shell trtllm-build \ --checkpoint_dir ./CodeLlama-13b-Instruct-hf_checkpoint \ --output_dir ./CodeLlama-13B-Instruct-GPTQ_TensorRT \ --gemm_plugin float16 \ --max_input_len 8192 \ --max_seq_len 8192 ``` ## Prompt template: CodeLlama ``` [INST] <> Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```: <> {prompt} [/INST] ``` ## How to use this model from Python code ### Using TensorRT-LLM API #### Install the necessary packages ```shell pip3 install tensorrt_llm==0.15.0.dev2024101500 -U --pre --extra-index-url https://pypi.nvidia.com ``` Beware that this command should not be run from a virtual environment (or twice, one time outside venv and then using venv). #### Use the TensorRT-LLM API ```python from tensorrt_llm import LLM, SamplingParams system_prompt = \ "[INST] <>\n" +\ "Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:" +\ "\n<>\n\n" user_prompt = \ "" +\ " [/INST] " prompts = [ system_prompt + user_prompt, ] sampling_params = SamplingParams(max_tokens=512, temperature=1.31, top_p=0.14, top_k=49, repetition_penalty=1.17) llm = LLM(model="./CodeLlama-13B-Instruct-GPTQ_TensorRT") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ### Using Oobabooga's Text Generation WebUI Follow instructions described here: https://github.com/oobabooga/text-generation-webui/pull/5715 Use version 0.15.0.dev2024101500 of tensorrt_llm instead of 0.10.0.