Mistral-Small-Instruct CTranslate2 Model

This repository contains a CTranslate2 version of the Mistral-Small-Instruct model. The conversion process involved AWQ quantization followed by CTranslate2 format conversion.

Quantization Parameters

The following AWQ parameters were used: zero_point=true q_group_size=128 w_bit=4 version=gemv

Quantization Process

The quantization was performed using the AutoAWQ library. AutoAWQ supports two quantization approaches:

Without calibration data:
- Quick process (~few minutes)
- Uses standard quantization schema
- Suitable for general use cases
With calibration data:
- Longer process (3-4 hours on RTX 4090)
- Preserves full precision for task-specific weights
- Slightly better performance for targeted tasks

Calibration Details

This model was quantized with calibration data. Specifically, the cosmopedia-100k dataset was used, which is good for overall QA and instruction-following.

Key parameters:

max_calib_seq_len: 8192 (enables long-form responses)
text_token_length: 2048 (minimum input token length during quantization)

While these parameters don't fundamentally alter the model's architecture, they fine-tune its behavior for specific input-output length patterns and topic domains.

Requirements

torch 2.2.2 ctranslate2 4.4.0

NOTE: The soon-to-be-released ctranslate2 4.5.0 will support torch greater than version 2.2.2. These instructions will be updated when that occurs.

Sample Script

import os
import sys
import ctranslate2
import gc
import torch
from transformers import AutoTokenizer

system_message = "You are a helpful person who answers questions."
user_message = "Hello, how are you today? I'd like you to write me a funny poem that is a parody of Milton's Paradise Lost if you are familiar with that famous epic poem?"

model_dir = r"D:\Scripts\bench_chat\models\mistralai--Mistral-Small-Instruct-2409-AWQ-ct2-awq" # uses ~13.8 GB


def build_prompt_mistral_small():
    prompt = f"""<s>
[INST] {system_message}

{user_message}[/INST]"""
    
    return prompt


def main():
    model_name = os.path.basename(model_dir)

    print(f"\033[32mLoading the model: {model_name}...\033[0m")
    
    intra_threads = max(os.cpu_count() - 4, 4)

    generator = ctranslate2.Generator(
        model_dir,
        device="cuda",
        # compute_type="int8_bfloat16", # NOTE...YOU DO NOT USE THIS AT ALL WHEN USING AWQ/CTRANSLATE2 MODELS
        intra_threads=intra_threads
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)
    
    prompt = build_prompt_mistral_small()
    
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    
    print(f"\nRun 1 (Beam Size: {beam_size}):")
    
    results_batch = generator.generate_batch(
        [tokens],
        include_prompt_in_result=False,
        max_batch_size=4096,
        batch_type="tokens",
        beam_size=1,
        num_hypotheses=1,
        max_length=512,
        sampling_temperature=0.0,
    )

    output = tokenizer.decode(results_batch[0].sequences_ids[0])
    
    print("\nGenerated response:")
    print(output)
    
    del generator
    del tokenizer
    torch.cuda.empty_cache()
    gc.collect()


if __name__ == "__main__":
    main()