Update README.md
Browse files
README.md
CHANGED
@@ -26,20 +26,20 @@ Omnivision is a compact, sub-billion (968M) multimodal model for processing both
|
|
26 |
Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
|
27 |
|
28 |
**Example Demo:**
|
29 |
-
|
30 |
|
31 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/PTG3_n_p7_atBHCwRLOEE.png" alt="Example" style="width:700px;"/>
|
32 |
|
33 |
|
34 |
## Benchmarks
|
35 |
|
36 |
-
Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks,
|
37 |
|
38 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
|
39 |
|
40 |
-
We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of
|
41 |
|
42 |
-
| Benchmark | Nexa AI
|
43 |
|-------------------|----------------------|-----------|-------------|
|
44 |
| MM-VET | 27.5 | 23.9 | 49.5 |
|
45 |
| ChartQA (Test) | 59.2 | NA | 73.5 |
|
@@ -51,7 +51,7 @@ We have conducted a series of experiments on benchmark datasets, including MM-VE
|
|
51 |
|
52 |
|
53 |
## How to Use On Device
|
54 |
-
In the following, we demonstrate how to run
|
55 |
|
56 |
**Step 1: Install Nexa-SDK (local on-device inference framework)**
|
57 |
|
@@ -66,7 +66,7 @@ nexa run omnivision
|
|
66 |
```
|
67 |
|
68 |
## Model Architecture ##
|
69 |
-
|
70 |
|
71 |
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
|
72 |
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
|
@@ -76,7 +76,7 @@ The vision encoder first transforms input images into embeddings, which are then
|
|
76 |
|
77 |
## Training
|
78 |
|
79 |
-
We developed
|
80 |
|
81 |
**Pretraining:**
|
82 |
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
|
@@ -90,7 +90,5 @@ The final stage implements DPO by first generating responses to images using the
|
|
90 |
## What's next?
|
91 |
We are continually improving Omnivision for better on-device performance. Stay tuned.
|
92 |
|
93 |
-
###
|
94 |
-
[Blogs](https://nexa.ai)
|
95 |
-
### Join Discord Community
|
96 |
-
[Discord](https://discord.gg/nexa-ai)
|
|
|
26 |
Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
|
27 |
|
28 |
**Example Demo:**
|
29 |
+
Omnivision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
|
30 |
|
31 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/PTG3_n_p7_atBHCwRLOEE.png" alt="Example" style="width:700px;"/>
|
32 |
|
33 |
|
34 |
## Benchmarks
|
35 |
|
36 |
+
Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model.
|
37 |
|
38 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
|
39 |
|
40 |
+
We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision.
|
41 |
|
42 |
+
| Benchmark | Nexa AI Omnivision | nanoLLAVA | Qwen2-VL-2B |
|
43 |
|-------------------|----------------------|-----------|-------------|
|
44 |
| MM-VET | 27.5 | 23.9 | 49.5 |
|
45 |
| ChartQA (Test) | 59.2 | NA | 73.5 |
|
|
|
51 |
|
52 |
|
53 |
## How to Use On Device
|
54 |
+
In the following, we demonstrate how to run Omnivision locally on your device.
|
55 |
|
56 |
**Step 1: Install Nexa-SDK (local on-device inference framework)**
|
57 |
|
|
|
66 |
```
|
67 |
|
68 |
## Model Architecture ##
|
69 |
+
Omnivision's architecture consists of three key components:
|
70 |
|
71 |
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
|
72 |
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
|
|
|
76 |
|
77 |
## Training
|
78 |
|
79 |
+
We developed Omnivision through a three-stage training pipeline:
|
80 |
|
81 |
**Pretraining:**
|
82 |
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
|
|
|
90 |
## What's next?
|
91 |
We are continually improving Omnivision for better on-device performance. Stay tuned.
|
92 |
|
93 |
+
### Follow us
|
94 |
+
[Blogs](https://nexa.ai) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/alanzhuly)
|
|
|
|