--- license: apache-2.0 base_model: - mistral-community/pixtral-12b pipeline_tag: image-text-to-text library_name: transformers --- Converted from [mistral-community/pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b) using BitsAndBytes with NF4 (4-bit) quantization. Not using double quantization. Requires `bitsandbytes` to load. Example usage for image captioning: ```python from transformers import LlavaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig from PIL import Image import time # Load model model_id = "SeanScripts/pixtral-12b-nf4" model = LlavaForConditionalGeneration.from_pretrained( model_id, use_safetensors=True, device_map="cuda:0" ) # Load tokenizer processor = AutoProcessor.from_pretrained(model_id) # Caption a local image IMG_URLS = [Image.open("test.png").convert("RGB")] PROMPT = "[INST]Caption this image:\n[IMG][/INST]" inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda") prompt_tokens = len(inputs['input_ids'][0]) print(f"Prompt tokens: {prompt_tokens}") t0 = time.time() generate_ids = model.generate(**inputs, max_new_tokens=512) t1 = time.time() total_time = t1 - t0 generated_tokens = len(generate_ids[0]) - prompt_tokens time_per_token = generated_tokens/total_time print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)") output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(output) ``` On a 4090, this is getting about 10 - 12 tok/s (without flash attention) and the captions seem pretty good, though I haven't tested very many. It uses about 10 GB VRAM. You can get a set of ComfyUI custom nodes for running this model here: [https://github.com/SeanScripts/ComfyUI-PixtralLlamaVision](https://github.com/SeanScripts/ComfyUI-PixtralLlamaVision)