--- title: README emoji: 🏢 colorFrom: blue colorTo: gray sdk: gradio pinned: false ---

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

[📖 Project] [📄 Paper] [💻 Code] [📝 Dataset] [🤖 Evaluation Model] [🏆 Leaderboard]

We introduce **MMIE**, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With **20K+ examples** covering **12 fields** and **102 subfields**, **MMIE** is definitely setting new standards for testing the depths of multimodal understanding.

### 🔑 Key Features: - **🗂 Comprehensive Dataset:** With **20,103 interleaved multimodal questions**, MMIE provides a rich foundation for evaluating models across diverse domains. - **🔍 Ground Truth Reference:** Each query includes a reliable reference, ensuring model outputs are measured accurately. - **⚙ Automated Scoring with MMIE-Score:** Our scoring model achieves **high human-score correlation**, surpassing previous metrics like GPT-4o for multimodal tasks. - **🔎 Bias Mitigation:** Fine-tuned for fair assessments, enabling more objective model evaluations. --- ### 🔍 Key Insights: 1. **🧠 In-depth Evaluation**: Covering **12 major fields** (mathematics, coding, literature, and more) with **102 subfields** for a comprehensive test across competencies. 2. **📈 Challenging the Best**: Even top models like **GPT-4o + SDXL** peak at 65.47%, highlighting room for growth in LVLMs. 3. **🌐 Designed for Interleaved Tasks**: The benchmark supports evaluation across both text and image comprehension with both **multiple-choice and open-ended** formats.