alanzhuly commited on
Commit
d658b03
1 Parent(s): 52e9f27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -11
README.md CHANGED
@@ -26,20 +26,20 @@ Omnivision is a compact, sub-billion (968M) multimodal model for processing both
26
  Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
27
 
28
  **Example Demo:**
29
- Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
30
 
31
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/PTG3_n_p7_atBHCwRLOEE.png" alt="Example" style="width:700px;"/>
32
 
33
 
34
  ## Benchmarks
35
 
36
- Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model.
37
 
38
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
39
 
40
- We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of omnivision.
41
 
42
- | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
43
  |-------------------|----------------------|-----------|-------------|
44
  | MM-VET | 27.5 | 23.9 | 49.5 |
45
  | ChartQA (Test) | 59.2 | NA | 73.5 |
@@ -51,7 +51,7 @@ We have conducted a series of experiments on benchmark datasets, including MM-VE
51
 
52
 
53
  ## How to Use On Device
54
- In the following, we demonstrate how to run omnivision locally on your device.
55
 
56
  **Step 1: Install Nexa-SDK (local on-device inference framework)**
57
 
@@ -66,7 +66,7 @@ nexa run omnivision
66
  ```
67
 
68
  ## Model Architecture ##
69
- Omni-Vision's architecture consists of three key components:
70
 
71
  - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
72
  - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
@@ -76,7 +76,7 @@ The vision encoder first transforms input images into embeddings, which are then
76
 
77
  ## Training
78
 
79
- We developed Omni-Vision through a three-stage training pipeline:
80
 
81
  **Pretraining:**
82
  The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
@@ -90,7 +90,5 @@ The final stage implements DPO by first generating responses to images using the
90
  ## What's next?
91
  We are continually improving Omnivision for better on-device performance. Stay tuned.
92
 
93
- ### Learn more in our blogs
94
- [Blogs](https://nexa.ai)
95
- ### Join Discord Community
96
- [Discord](https://discord.gg/nexa-ai)
 
26
  Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.
27
 
28
  **Example Demo:**
29
+ Omnivision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro
30
 
31
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/PTG3_n_p7_atBHCwRLOEE.png" alt="Example" style="width:700px;"/>
32
 
33
 
34
  ## Benchmarks
35
 
36
+ Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model.
37
 
38
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>
39
 
40
+ We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision.
41
 
42
+ | Benchmark | Nexa AI Omnivision | nanoLLAVA | Qwen2-VL-2B |
43
  |-------------------|----------------------|-----------|-------------|
44
  | MM-VET | 27.5 | 23.9 | 49.5 |
45
  | ChartQA (Test) | 59.2 | NA | 73.5 |
 
51
 
52
 
53
  ## How to Use On Device
54
+ In the following, we demonstrate how to run Omnivision locally on your device.
55
 
56
  **Step 1: Install Nexa-SDK (local on-device inference framework)**
57
 
 
66
  ```
67
 
68
  ## Model Architecture ##
69
+ Omnivision's architecture consists of three key components:
70
 
71
  - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
72
  - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
 
76
 
77
  ## Training
78
 
79
+ We developed Omnivision through a three-stage training pipeline:
80
 
81
  **Pretraining:**
82
  The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
 
90
  ## What's next?
91
  We are continually improving Omnivision for better on-device performance. Stay tuned.
92
 
93
+ ### Follow us
94
+ [Blogs](https://nexa.ai) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/alanzhuly)