Update README.md
Browse files
README.md
CHANGED
@@ -10,20 +10,20 @@ tags:
|
|
10 |
|
11 |
## Introduction
|
12 |
|
13 |
-
|
14 |
-
|
15 |
-
- **9x Token Reduction through Token Compression**: Significant decrease in image token count, reducing latency and computational cost, ideal for on-device applications.
|
16 |
-
- **Minimal-Edit DPO for Enhanced Response Quality**: Improves model responses by using targeted edits, maintaining core capabilities without significant behavior shifts.
|
17 |
|
|
|
|
|
|
|
18 |
**Quick Links:**
|
19 |
-
1.
|
20 |
-
2. [Quickstart
|
21 |
-
3. Learn more in [
|
22 |
|
23 |
**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
|
24 |
|
25 |
## Intended Use Cases
|
26 |
-
|
27 |
|
28 |
**Example Demo:**
|
29 |
Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
|
@@ -33,7 +33,7 @@ Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time
|
|
33 |
|
34 |
## Benchmarks
|
35 |
|
36 |
-
Below we demonstrate a figure to show how
|
37 |
|
38 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
|
39 |
|
@@ -50,7 +50,7 @@ We have conducted a series of experiments on benchmark datasets, including MM-VE
|
|
50 |
| POPE | 89.4 | 84.1 | NA |
|
51 |
|
52 |
|
53 |
-
## How to Use
|
54 |
In the following, we demonstrate how to run omnivision locally on your device.
|
55 |
|
56 |
**Step 1: Install Nexa-SDK (local on-device inference framework)**
|
@@ -87,6 +87,8 @@ We enhance the model's contextual understanding using image-based question-answe
|
|
87 |
**Direct Preference Optimization (DPO):**
|
88 |
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
|
89 |
|
|
|
|
|
90 |
|
91 |
### Learn more in our blogs
|
92 |
[Blogs](https://nexa.ai)
|
|
|
10 |
|
11 |
## Introduction
|
12 |
|
13 |
+
Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Built on LLaVA's architecture, it features:
|
|
|
|
|
|
|
14 |
|
15 |
+
- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
|
16 |
+
- **Minimal-Edit DPO**: Enhances response quality with minimal edits, preserving core model behavior.
|
17 |
+
|
18 |
**Quick Links:**
|
19 |
+
1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
|
20 |
+
2. [Quickstart for local setup](#how-to-use-on-device)
|
21 |
+
3. Learn more in our [Blogs](https://nexa.ai)
|
22 |
|
23 |
**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)
|
24 |
|
25 |
## Intended Use Cases
|
26 |
+
Omnivision is designed for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
|
27 |
|
28 |
**Example Demo:**
|
29 |
Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
|
|
|
33 |
|
34 |
## Benchmarks
|
35 |
|
36 |
+
Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
|
37 |
|
38 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
|
39 |
|
|
|
50 |
| POPE | 89.4 | 84.1 | NA |
|
51 |
|
52 |
|
53 |
+
## How to Use On Device
|
54 |
In the following, we demonstrate how to run omnivision locally on your device.
|
55 |
|
56 |
**Step 1: Install Nexa-SDK (local on-device inference framework)**
|
|
|
87 |
**Direct Preference Optimization (DPO):**
|
88 |
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
|
89 |
|
90 |
+
## What's next?
|
91 |
+
We are continually improving Omnivision for better on-device performance. Stay tuned.
|
92 |
|
93 |
### Learn more in our blogs
|
94 |
[Blogs](https://nexa.ai)
|