|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- liuhaotian/LLaVA-Pretrain |
|
- lmms-lab/LLaVA-NeXT-Data |
|
base_model: |
|
- Qwen/Qwen2.5-7B-Instruct |
|
--- |
|
|
|
[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom) |
|
## Model |
|
We used [**MLCD**](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) as the Vision Encoder in [LLaVA-Next](https://huggingface.co/lmms-lab/llava-next-qwen-32b). |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png) |
|
|
|
|
|
## Data |
|
Our model was trained on publicly available data from the [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) datasets. |
|
|
|
## How to eval |
|
```shell |
|
pip install lmms-eval==0.2.0 |
|
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ |
|
python -m accelerate.commands.launch \ |
|
--main_process_port=12581 \ |
|
--num_processes=8 \ |
|
-m lmms_eval \ |
|
--model llava \ |
|
--model_args pretrained=DeepGlint-AI/llava-mlcd-qwen2.5-7b,conv_template=qwen_1_5 \ |
|
--tasks mmbench,mme,mmmu,ocrbench,scienceqa,scienceqa_img,seedbench,gqa,pope,textvqa_val,ai2d,chartqa,docvqa_val,infovqa_val,mmstar \ |
|
--batch_size 1 \ |
|
--log_samples \ |
|
--log_samples_suffix mlcd_llava_qwen2_7b \ |
|
--output_path ./log |
|
``` |
|
|
|
|
|
## Performance and Limitations |
|
|
|
In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs. |
|
|
|
| Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) | |
|
|:----------------|:-------------|:-------------| |
|
| LLM | Qwen2.5-7B | Qwen2.5-7B | |
|
| AI2D | **76.98** | 73.15 | |
|
| ScienceQA_img | **78.09** | 76.35 | |
|
| GQA | **64.17** | 63.31 | |
|
| InfoVQA_val | **43.48** | 38.88 | |
|
| MMBench_cn_dev | **74.83** | 72.51 | |
|
| MMBench_en_dev | **76.37** | 74.57 | |
|
| MME(cognition) | **432** | 384 | |
|
| MME(perception) | **1598** | 1512 | |
|
| SeedBench | **68.20** | 66.80 | |
|
| SeedBench_img | **73.75** | 72.72 | |
|
| MMStar | **50.98** | 48.98 | |
|
| MMMU | **44.30** | 44.20 | |
|
| OCRBench | **531.00** | 525.00 | |
|
| ChartQA | **67.84** | 66.52 | |
|
| DocVQA_val | **76.46** | 75.21 | |
|
| POPE | 88.69 | **88.83** | |
|
| TextVQA_val | 61.69 | **62.47** | |
|
|
|
### C. Limitations |
|
Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available. |
|
|
|
|
|
## Acknowledgments |
|
|
|
We would like to express our gratitude to [Yumeng Wang](https://huggingface.co/devymex) for his significant contributions to the experimental validation in MLLMs. |