---
title: README
emoji: 🏢
colorFrom: blue
colorTo: gray
sdk: gradio
pinned: false
---
We introduce **MMIE**, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With **20K+ examples** covering **12 fields** and **102 subfields**, **MMIE** is definitely setting new standards for testing the depths of multimodal understanding.
### 🔑 Key Features:
- **🗂 Comprehensive Dataset:** With **20,103 interleaved multimodal questions**, MMIE provides a rich foundation for evaluating models across diverse domains.
- **🔍 Ground Truth Reference:** Each query includes a reliable reference, ensuring model outputs are measured accurately.
- **⚙ Automated Scoring with MMIE-Score:** Our scoring model achieves **high human-score correlation**, surpassing previous metrics like GPT-4o for multimodal tasks.
- **🔎 Bias Mitigation:** Fine-tuned for fair assessments, enabling more objective model evaluations.
---
### 🔍 Key Insights:
1. **🧠 In-depth Evaluation**: Covering **12 major fields** (mathematics, coding, literature, and more) with **102 subfields** for a comprehensive test across competencies.
2. **📈 Challenging the Best**: Even top models like **GPT-4o + SDXL** peak at 65.47%, highlighting room for growth in LVLMs.
3. **🌐 Designed for Interleaved Tasks**: The benchmark supports evaluation across both text and image comprehension with both **multiple-choice and open-ended** formats.