---
license: apache-2.0
language:
- zh
- en
metrics:
- bleu
base_model:
- DeepGlint-AI/mlcd-vit-large-patch14-336
---
[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
## Performance in RoboVQA and OpenEQA
| | | MLCD
Embodied-7B | LLaVA
OneVision-7B | GPT-4v | RoboMamba |
:-- | :-- | :-: | :-: | :-: | :-: |
| RoboVQA | BLEU1 | 73.16 | 38.12 | - | 54.9 |
| | BLEU2 | 66.39 | 33.56 | - | 44.2 |
| | BLEU3 | 60.61 | 31.76 | - | 39.5 |
| | BLEU4 | 56.56 | 30.97 | - | 36.3 |
| OpenEQA | Object State Recognition | 71.83 | - | 63.2 | - |
| | Object Recognition | 49.46 | - | 43.4 | - |
| | Functional Reasoning | 54.38 | - | 57.4 | - |
| | Spatial Understanding | 48.64 | - | 33.6 | - |
| | Attribute Recognition | 67.08 | - | 57.2 | - |
| | World Knowledge | 53.87 | - | 50.7 | - |
| | Object Localization | 43.06 | - | 42.0 | - |
## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
| Dataset | Split | MLCD
Embodied-7B | LLaVA
OneVision-7B | GPT-4v | GPT-4o |
| :-- | :-: | :-: | :-: | :-: | :-: |
| A12D | test | 79.9 | 81.4 | 78.2 | 94.2 |
| ChartQA | test | 83.0 | 80.0 | 78.5 | 85.7 |
| DocVQA | test | 91.6 | 87.5 | 88.4 | 92.8 |
| InfoVQA | val | 73.9 | 70.7 | - | - |
| InfoVQA | test | 70.0 | 68.8 | - | - |
| MMMU | val | 47.3 | 48.8 | 56.8 | 69.1 |
| MMStar | test | 58.5 | 61.7 | 57.1 | 63.9 |
| OCRBench | - | 749.0 | 697.0 | 656.0 | 805.0 |
| RealWorldQA | test | 68.9 | 66.3 | 61.4 | 58.6 |
| SeedBench | image | 74.9 | 75.4 | 49.9 | 76.2 |
| MMbench | en-dev | 81.1 | 83.2 | 81.3 | 83.4 |
| MMbench | en-test | 80.1 | 80.8 | 75.0 | - |
| MME | test | 578/1603 | 418/1580 | 517/1409 | - |
We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs.