DeepGlint-AI
/

llava-mlcd-qwen2.5-7b

Model card Files Files and versions Community

llava-mlcd-qwen2.5-7b / README.md

xiangan's picture

Update README.md

54329c5 verified about 1 month ago

|

history blame contribute delete

2.89 kB

	---
	license: apache-2.0
	datasets:
	- liuhaotian/LLaVA-Pretrain
	- lmms-lab/LLaVA-NeXT-Data
	base_model:
	- Qwen/Qwen2.5-7B-Instruct
	---

	[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
	## Model
	We used [MLCD](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) as the Vision Encoder in [LLaVA-Next](https://huggingface.co/lmms-lab/llava-next-qwen-32b).
	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)


	## Data
	Our model was trained on publicly available data from the [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) datasets.

	## How to eval
	```shell
	pip install lmms-eval==0.2.0

	CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
	python -m accelerate.commands.launch \
	--main_process_port=12581 \
	--num_processes=8 \
	-m lmms_eval \
	--model llava \
	--model_args pretrained=DeepGlint-AI/llava-mlcd-qwen2.5-7b,conv_template=qwen_1_5 \
	--tasks mmbench,mme,mmmu,ocrbench,scienceqa,scienceqa_img,seedbench,gqa,pope,textvqa_val,ai2d,chartqa,docvqa_val,infovqa_val,mmstar \
	--batch_size 1 \
	--log_samples \
	--log_samples_suffix mlcd_llava_qwen2_7b \
	--output_path ./log
	```


	## Performance and Limitations

	In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

	\| Vision Tower \| MLCD (ViT_L_14_336px) \| CLIP (ViT_L_14_336px) \|
	\|:----------------\|:-------------\|:-------------\|
	\| LLM \| Qwen2.5-7B \| Qwen2.5-7B \|
	\| AI2D \| 76.98 \| 73.15 \|
	\| ScienceQA_img \| 78.09 \| 76.35 \|
	\| GQA \| 64.17 \| 63.31 \|
	\| InfoVQA_val \| 43.48 \| 38.88 \|
	\| MMBench_cn_dev \| 74.83 \| 72.51 \|
	\| MMBench_en_dev \| 76.37 \| 74.57 \|
	\| MME(cognition) \| 432 \| 384 \|
	\| MME(perception) \| 1598 \| 1512 \|
	\| SeedBench \| 68.20 \| 66.80 \|
	\| SeedBench_img \| 73.75 \| 72.72 \|
	\| MMStar \| 50.98 \| 48.98 \|
	\| MMMU \| 44.30 \| 44.20 \|
	\| OCRBench \| 531.00 \| 525.00 \|
	\| ChartQA \| 67.84 \| 66.52 \|
	\| DocVQA_val \| 76.46 \| 75.21 \|
	\| POPE \| 88.69 \| 88.83 \|
	\| TextVQA_val \| 61.69 \| 62.47 \|

	### C. Limitations
	Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available.


	## Acknowledgments

	We would like to express our gratitude to [Yumeng Wang](https://huggingface.co/devymex) for his significant contributions to the experimental validation in MLLMs.