license: cc
tags:
- multimodal
- conversational
- GGUF
- Image-Text-to-Text
Model Information
Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices.
Model Architecture: Omni-Vision's architecture consists of three key components:
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
Feedback: Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go here.
Intended Use Cases
- Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
- Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
## Benchmarks
Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
---|---|---|---|
MM-VET | 27.5 | 23.9 | 49.5 |
ChartQA (Test) | 59.2 | NA | 73.5 |
MMMU (Test) | 41.8 | 28.6 | 41.1 |
MMMU (Eval) | 39.9 | 30.4 | 41.1 |
ScienceQA (Eval) | 62.2 | 59.0 | NA |
ScienceQA (Test) | 64.5 | 59.0 | NA |
POPE | 89.4 | 84.1 | NA |
How to use
This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original llama
codebase.
Test in HuggingFace Space
Run Locally
Install Nexa-SDK
nexa run omnivision
Training
We developed Omni-Vision through a three-stage training pipeline:
Pretraining: The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
Supervised Fine-tuning (SFT): We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.
Direct Preference Optimization (DPO): The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics