---
license: apache-2.0
language:
- zh
- en
metrics:
- bleu
base_model:
- DeepGlint-AI/mlcd-vit-large-patch14-336
---


[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)  

## Performance in RoboVQA and OpenEQA


|                |                   | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
 :-- | :-- | :-: | :-: | :-: | :-: |
| RoboVQA        | BLEU1             | <span style="color:red">73.16</span>       | 38.12                   |            -              | 54.9      |
|                | BLEU2             | <span style="color:red">66.39</span>       | 33.56                   |            -              | 44.2      |
|                | BLEU3             | <span style="color:red">60.61</span>       | 31.76                   |            -              | 39.5      |
|                | BLEU4             | <span style="color:red">56.56</span>       | 30.97                   |            -              | 36.3      |
| OpenEQA        | Object State Recognition | <span style="color:red">71.83</span>   |          -               | 63.2   |            -              |
|                | Object Recognition        | <span style="color:red">49.46</span>  |          -               | 43.4   |            -              |
|                | Functional Reasoning      | 54.38                                 |          -               | <span style="color:red">57.4</span> |            -              |
|                | Spatial Understanding     | <span style="color:red">48.64</span>  |          -               | 33.6   |            -              |
|                | Attribute Recognition     | <span style="color:red">67.08</span>  |          -               | 57.2   |            -              |
|                | World Knowledge           | <span style="color:red">53.87</span>  |          -               | 50.7   |            -              |
|                | Object Localization       | <span style="color:red">43.06</span>  |          -               | 42.0   |            -              |


## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4

| Dataset     | Split   | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v   | GPT-4o |
| :-- | :-: | :-: | :-: | :-: | :-: |
| A12D        | test    | 79.9             | 81.4               | 78.2     | 94.2   |
| ChartQA     | test    | 83.0             | 80.0               | 78.5     | 85.7   |
| DocVQA      | test    | 91.6             | 87.5               | 88.4     | 92.8   |
| InfoVQA     | val     | 73.9             | 70.7               | -        | -      |
| InfoVQA     | test    | 70.0             | 68.8               | -        | -      |
| MMMU        | val     | 47.3             | 48.8               | 56.8     | 69.1   |
| MMStar      | test    | 58.5             | 61.7               | 57.1     | 63.9   |
| OCRBench    | -       | 749.0            | 697.0              | 656.0    | 805.0  |
| RealWorldQA | test    | 68.9             | 66.3               | 61.4     | 58.6   |
| SeedBench   | image   | 74.9             | 75.4               | 49.9     | 76.2   |
| MMbench     | en-dev  | 81.1             | 83.2               | 81.3     | 83.4   |
| MMbench     | en-test | 80.1             | 80.8               | 75.0     | -      |
| MME         | test    | 578/1603         | 418/1580           | 517/1409 | -      |


We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs.