NexaAIDev
/

omnivision-968M

@@ -26,20 +26,20 @@ Omnivision is a compact, sub-billion (968M) multimodal model for processing both
 Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
 **Example Demo:**
-Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/PTG3_n_p7_atBHCwRLOEE.png" alt="Example" style="width:700px;"/>
 ## Benchmarks
-Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
-We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of omnivision.
-| Benchmark         | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
 |-------------------|----------------------|-----------|-------------|
 | MM-VET            | 27.5                | 23.9      | 49.5        |
 | ChartQA (Test)    | 59.2                | NA        | 73.5        |
@@ -51,7 +51,7 @@ We have conducted a series of experiments on benchmark datasets, including MM-VE
 ## How to Use On Device
-In the following, we demonstrate how to run omnivision locally on your device.
 **Step 1: Install Nexa-SDK (local on-device inference framework)**
@@ -66,7 +66,7 @@ nexa run omnivision
 ```
 ## Model Architecture ##
-Omni-Vision's architecture consists of three key components:
 - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
 - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
@@ -76,7 +76,7 @@ The vision encoder first transforms input images into embeddings, which are then
 ## Training
-We developed Omni-Vision through a three-stage training pipeline:
 **Pretraining:**
 The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
@@ -90,7 +90,5 @@ The final stage implements DPO by first generating responses to images using the
 ## What's next?
 We are continually improving Omnivision for better on-device performance. Stay tuned.
-### Learn more in our blogs
-[Blogs](https://nexa.ai)
-### Join Discord Community
-[Discord](https://discord.gg/nexa-ai)

 Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
 **Example Demo:**
+Omnivision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/PTG3_n_p7_atBHCwRLOEE.png" alt="Example" style="width:700px;"/>
 ## Benchmarks
+Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model.
 <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
+We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision.
+| Benchmark         | Nexa AI Omnivision | nanoLLAVA | Qwen2-VL-2B |
 |-------------------|----------------------|-----------|-------------|
 | MM-VET            | 27.5                | 23.9      | 49.5        |
 | ChartQA (Test)    | 59.2                | NA        | 73.5        |
 ## How to Use On Device
+In the following, we demonstrate how to run Omnivision locally on your device.
 **Step 1: Install Nexa-SDK (local on-device inference framework)**
 ```
 ## Model Architecture ##
+Omnivision's architecture consists of three key components:
 - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
 - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
 ## Training
+We developed Omnivision through a three-stage training pipeline:
 **Pretraining:**
 The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
 ## What's next?
 We are continually improving Omnivision for better on-device performance. Stay tuned.
+### Follow us
+[Blogs](https://nexa.ai) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/alanzhuly)