metadata
title: README
emoji: π’
colorFrom: blue
colorTo: gray
sdk: gradio
pinned: false
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
[π Project] [π Paper] [π» Code] [π Dataset] [π€ Evaluation Model] [π Leaderboard]
We introduce MMIE, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With 20K+ examples covering 12 fields and 102 subfields, MMIE is definitely setting new standards for testing the depths of multimodal understanding.
π Key Features:
- π Comprehensive Dataset: With 20,103 interleaved multimodal questions, MMIE provides a rich foundation for evaluating models across diverse domains.
- π Ground Truth Reference: Each query includes a reliable reference, ensuring model outputs are measured accurately.
- β Automated Scoring with MMIE-Score: Our scoring model achieves high human-score correlation, surpassing previous metrics like GPT-4o for multimodal tasks.
- π Bias Mitigation: Fine-tuned for fair assessments, enabling more objective model evaluations.
π Key Insights:
- π§ In-depth Evaluation: Covering 12 major fields (mathematics, coding, literature, and more) with 102 subfields for a comprehensive test across competencies.
- π Challenging the Best: Even top models like GPT-4o + SDXL peak at 65.47%, highlighting room for growth in LVLMs.
- π Designed for Interleaved Tasks: The benchmark supports evaluation across both text and image comprehension with both multiple-choice and open-ended formats.