title: README
emoji: π’
colorFrom: blue
colorTo: gray
sdk: gradio
pinned: false
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
[π Project] [π Paper] [π» Code] [π Dataset] [π€ Evaluation Model] [π Leaderboard]
We introduce MMIE, a robust, knowledge-intensive benchmark to evaluate interleaved multimodal comprehension and generation in LVLMs. With 20K+ examples covering 12 fields and 102 subfields, MMIE is definitely setting new standards for testing the depths of multimodal understanding.
π Key Features:
- π Comprehensive Dataset: With 20,103 interleaved multimodal questions, MMIE provides a rich foundation for evaluating models across diverse domains.
- π Ground Truth Reference: Each query includes a reliable reference, ensuring model outputs are measured accurately.
- β Automated Scoring with MMIE-Score: Our scoring model achieves high human-score correlation, surpassing previous metrics like GPT-4o for multimodal tasks.
- π Bias Mitigation: Fine-tuned for fair assessments, enabling more objective model evaluations.
π Key Insights:
- π§ In-depth Evaluation: Covering 12 major fields (mathematics, coding, literature, and more) with 102 subfields for a comprehensive test across competencies.
- π Challenging the Best: Even top models like GPT-4o + SDXL peak at 65.47%, highlighting room for growth in LVLMs.
- π Designed for Interleaved Tasks: The benchmark supports evaluation across both text and image comprehension with both multiple-choice and open-ended formats.
π§ Dataset Details
MMIE is curated to evaluate models' comprehensive abilities in interleaved multimodal comprehension and generation. The dataset features diverse examples, categorized and distributed across different fields as illustrated above. This ensures balanced coverage across various domains of interleaved input/output tasks, supporting accurate and detailed model evaluations.