--- title: README emoji: 🏢 colorFrom: blue colorTo: gray sdk: gradio pinned: false ---

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

[📖 Project] [📄 Paper] [💻 Code] [📝 Dataset] [🤖 Evaluation Model] [🏆 Leaderboard]

We introduce **MMIE**, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With **20K+ examples** covering **12 fields** and **102 subfields**, **MMIE** is definitely setting new standards for testing the depths of multimodal understanding.

### 🔑 Key Features: - **🗂 Comprehensive Dataset:** With **20,103 interleaved multimodal questions**, MMIE provides a rich foundation for evaluating models across diverse domains. - **🔍 Ground Truth Reference:** Each query includes a reliable reference, ensuring model outputs are measured accurately. - **⚙ Automated Scoring with MMIE-Score:** Our scoring model achieves **high human-score correlation**, surpassing previous metrics like GPT-4o for multimodal tasks. - **🔎 Bias Mitigation:** Fine-tuned for fair assessments, enabling more objective model evaluations. --- ### 🔍 Key Insights: 1. **🧠 In-depth Evaluation**: Covering **12 major fields** (mathematics, coding, literature, and more) with **102 subfields** for a comprehensive test across competencies. 2. **📈 Challenging the Best**: Even top models like **GPT-4o + SDXL** peak at 65.47%, highlighting room for growth in LVLMs. 3. **🌐 Designed for Interleaved Tasks**: The benchmark supports evaluation across both text and image comprehension with both **multiple-choice and open-ended** formats. --- ### 🔧 Dataset Details

MMIE is curated to evaluate models' comprehensive abilities in interleaved multimodal comprehension and generation. The dataset features diverse examples, categorized and distributed across different fields as illustrated above. This ensures balanced coverage across various domains of interleaved input/output tasks, supporting accurate and detailed model evaluations.