Claude-3.5 Evaluation Results on Open VLM Leaderboard

Community Article Published June 24, 2024

Claude3.5-Sonnet is the latest large multi-modal model released by Anthropic, and it is the first version of the Claude 3.5 series. According to official blog, this model surpasses its predecessor such as Claude3-Opus and Gemini-1.5-Pro in terms of multi-modal understanding. To verify this, we tested Claude3.5-Sonnet on eight objective image-text multimodal evaluation benchmarks in the Open VLM Leaderboard.

Dataset \ Model	GPT-4o-20240513	Claude3.5-Sonnet	Gemini-1.5-Pro	GPT-4v-20240409	Claude3-Opus
Overall Rank	1	2	3	4	16
Avg. Score	69.9	67.9	64.4	63.5	54.4
MMBench v1.1	82.2	78.5	73.9	79.8	59.1
MMStar	63.9	62.2	59.1	56.0	45.7
MMMU_VAL	69.2	65.9	60.6	61.7	54.9
MathVista_MINI	61.3	61.6	57.7	54.7	45.8
HallusionBench Avg.	55.0	49.9	45.6	43.9	37.8
AI2D_TEST	84.6	80.2	79.1	78.6	70.6
OCRBench	736	788	754	656	694
MMVet	69.1	66	64	67.5	51.7

The evaluation results show that the objective performance of Claude3.5-Sonnet has greatly improved compared to Claude3-Opus, with the average score over all benchmarks improved more than 10%, and its overall ranking has risen from 16th to 2nd. Specifically, Claude3.5 ranked in the top two in six out of the eight benchmarks, and achieved the best results in multimodal mathematics and optical characters recognition.

Potential issues: API models such as GPT-4o and Claude3.5-Sonnet are released with officially reported performance on several multimodal evaluation benchmarks. Since they have not made the test scripts public, we failed to reproduce some of the accuracies reported by the officials (such as AI2D). If you can reproduce significantly higher accuracy on some benchmarks, please contact us for updates: [email protected].

For more detailed performance, please refer to the Open VLM Leaderboard.

Upvote