MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
Abstract
Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.
Community
We introduce MMIE, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With 20K+ examples covering 12 fields and 102 subfields, MMIE is definitely setting new standards for testing the depths of multimodal understanding.
- π Comprehensive Dataset: With 20,103 interleaved multimodal questions, MMIE provides a rich foundation for evaluating models across diverse domains.
- π Ground Truth Reference: Each query includes a reliable reference, ensuring model outputs are measured accurately.
- β Automated Scoring with MMIE-Score: Our scoring model achieves high human-score correlation, surpassing previous metrics like GPT-4o for multimodal tasks.
- π Bias Mitigation: Fine-tuned for fair assessments, enabling more objective model evaluations.
- Homepage: https://mmie-bench.github.io/
- Code: https://github.com/Lillianwei-h/MMIE
- MMIE-Score: https://huggingface.co/MMIE/MMIE-Score
Huggingface homepage: https://huggingface.co/MMIE
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection (2024)
- MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark (2024)
- OmniBench: Towards The Future of Universal Omni-Language Models (2024)
- MMR: Evaluating Reading Ability of Large Multimodal Models (2024)
- SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend