We introduce MMIE, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With 20K+ examples covering 12 fields and 102 subfields, MMIE is definitely setting new standards for testing the depths of multimodal understanding.
🔑 Key Features:
- 🗂 Comprehensive Dataset: With 20,103 interleaved multimodal questions, MMIE provides a rich foundation for evaluating models across diverse domains.
- 🔍 Ground Truth Reference: Each query includes a reliable reference, ensuring model outputs are measured accurately.
- ⚙ Automated Scoring with MMIE-Score: Our scoring model achieves high human-score correlation, surpassing previous metrics like GPT-4o for multimodal tasks.
- 🔎 Bias Mitigation: Fine-tuned for fair assessments, enabling more objective model evaluations.
🔍 Key Insights:
- 🧠 In-depth Evaluation: Covering 12 major fields (mathematics, coding, literature, and more) with 102 subfields for a comprehensive test across competencies.
- 📈 Challenging the Best: Even top models like GPT-4o + SDXL peak at 65.47%, highlighting room for growth in LVLMs.
- 🌐 Designed for Interleaved Tasks: The benchmark supports evaluation across both text and image comprehension with both multiple-choice and open-ended formats.