Claude-3.5 Evaluation Results on Open VLM Leaderboard
Claude3.5-Sonnet is the latest large multi-modal model released by Anthropic, and it is the first version of the Claude 3.5 series. According to official blog, this model surpasses its predecessor such as Claude3-Opus and Gemini-1.5-Pro in terms of multi-modal understanding. To verify this, we tested Claude3.5-Sonnet on eight objective image-text multimodal evaluation benchmarks in the Open VLM Leaderboard.
Dataset \ Model | GPT-4o-20240513 | Claude3.5-Sonnet | Gemini-1.5-Pro | GPT-4v-20240409 | Claude3-Opus |
---|---|---|---|---|---|
Overall Rank | 1 | 2 | 3 | 4 | 16 |
Avg. Score | 69.9 | 67.9 | 64.4 | 63.5 | 54.4 |
MMBench v1.1 | 82.2 | 78.5 | 73.9 | 79.8 | 59.1 |
MMStar | 63.9 | 62.2 | 59.1 | 56.0 | 45.7 |
MMMU_VAL | 69.2 | 65.9 | 60.6 | 61.7 | 54.9 |
MathVista_MINI | 61.3 | 61.6 | 57.7 | 54.7 | 45.8 |
HallusionBench Avg. | 55.0 | 49.9 | 45.6 | 43.9 | 37.8 |
AI2D_TEST | 84.6 | 80.2 | 79.1 | 78.6 | 70.6 |
OCRBench | 736 | 788 | 754 | 656 | 694 |
MMVet | 69.1 | 66 | 64 | 67.5 | 51.7 |
The evaluation results show that the objective performance of Claude3.5-Sonnet has greatly improved compared to Claude3-Opus, with the average score over all benchmarks improved more than 10%, and its overall ranking has risen from 16th to 2nd. Specifically, Claude3.5 ranked in the top two in six out of the eight benchmarks, and achieved the best results in multimodal mathematics and optical characters recognition.
Potential issues: API models such as GPT-4o and Claude3.5-Sonnet are released with officially reported performance on several multimodal evaluation benchmarks. Since they have not made the test scripts public, we failed to reproduce some of the accuracies reported by the officials (such as AI2D). If you can reproduce significantly higher accuracy on some benchmarks, please contact us for updates: [email protected].
For more detailed performance, please refer to the Open VLM Leaderboard.