MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
[π Project] [π Paper] [π» Code] [π Dataset] [π€ Evaluation Model] [π Leaderboard] [π Overview] [π§ Metric Details] [π© Citation]
π Overview
We present MMIE, a Massive Multimodal Interleaved understanding Evaluation benchmark, designed specifically for Large Vision-Language Models (LVLMs). MMIE offers a robust, automated evaluation metric, powered by Intern-VL2, to assess interleaved comprehension and generation capabilities across diverse fields.
This automated evaluation metric provides a reliable, streamlined approach to scoring LVLMs based on their performance in multimodal reasoning tasks. It is tailored to handle interleaved inputs and outputs, ensuring unbiased and consistent evaluation results.
π― Key Features of the MMIE Evaluation Metric:
- Automated Scoring System: Fine-tuned InternVL-2-4B is employed as the foundation of the scoring system, offering high performance and support for multi-image input.
- Bias Mitigation: The model is fine-tuned to minimize biases and provide fair, objective scoring across all models tested.
- Multimodal Focus: Tailored to handle interleaved multimodal inputs and outputs, ensuring models are judged on their ability to integrate and reason with both text and images.
- Human-like Evaluation: Our metric shows high correlation with human annotations, surpassing alternative automated metrics like GPT-4o, especially in nuanced multimodal tasks.
- Scalable and Consistent: The evaluation metric is built to handle large-scale datasets, offering consistent and reproducible scoring results, making it perfect for model benchmarking and comparison.
π§ Metric Details
Pipeline
To ensure a comprehensive and unbiased evaluation of various LVLMs, we propose an automated evaluation metric powered by InternVL-2-4B. This model was selected for its strong performance in multimodal reasoning tasks and its ability to support multi-image inputs. Furthermore, we fine-tuned the model to mitigate potential biases and provide accurate, consistent scoring.
The evaluation pipeline leverages the internally fine-tuned LVLM to assess models based on key dimensions such as text quality, image quality, text-image coherence, and stylistic consistency. This ensures models are rigorously tested on their multimodal reasoning capabilities.
Results
Note: In the image, higher values indicate better performance for Pearson and Cosine Similarity, while lower values are better for MSE and MAE.
The MMIE evaluation metric demonstrates superior performance in scoring, achieving the highest correlation with human annotations in all aspects of multimodal comprehension and generation. It consistently outperforms GPT-4o and other standard evaluation metrics, proving its reliability for large-scale model benchmarking.
Installation
To use our benchmark and evaluation metric, please refer to our Github repo.
π© Citation
If you find our benchmark useful in your research, please kindly consider citing us:
@article{xia2024mmie,
title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models},
author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu},
journal={arXiv preprint arXiv:2410.10139},
year={2024}
}
- Downloads last month
- 19
Model tree for MMIE/MMIE-Score
Base model
OpenGVLab/InternVL2-4B