RealWorldQA, What's New?
This is a short blog that introduce the RealWorldQA Benchmark.
What is RealWorldQA?
RealWorldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models, contributed by XAI. It assesses how well these models comprehend physical environments. The benchmark consists of 700+ images, each accompanied by a question and a verifiable answer. These images are drawn from real-world scenarios, including those captured from vehicles. The goal is to advance AI models' understanding of our physical world.
Statistics & Info
Name | Type | #Questions | Data Quality* (manually verified 10% samples) | Finegrained Classes |
---|---|---|---|---|
RealWorldQA | MCQ | 765 | > 97% | No |
TL;DR: **RealWorldQA **is a benchmark that requires VLMs to have the capability of:
- Recognize details in high-res images (1080p, etc.).
- Perform reasoning based on recognition results (may require commonsense knowledge).
*Data Quality: We perform manual verification on 10% samples and check if each sample is correct and unambiguous. Most samples (>97%) in RealWorldQA are good and clear.
Some cases I found ambiguous like:
- Question: Where is the dog in relation to the door?
- Choices: A. The dog is behind the door; B. The dog is next to the door; C. The dog is in front of the door.
- Answer: A
- Why ambiguous: The dog is actually between two doors.
- Question: How far from the camera is the rightmost vehicle?
- Choices: A. 15 meters; B. 35 meters; C. 55 meters.
- Answer: C
- Why ambiguous: Is the rightmost car that far?
Performance
Questions in RealWorldQA have 2 - 4 candidate choices (the majority have 3 choices), the expectation of RandomGuess Top-1 accuracy would be 37.7%.
We perform the evaluation using VLMEvalKit and list the performance of representative VLMs (proprietarty or opensource) below:
Proprietary Models | Acc | Proprietary Models | Acc |
---|---|---|---|
GPT-4v (0409, low-res) | 61.4 | GPT-4v (0409, high-res) | 68.0 |
GeminiPro-V (1.0) | 60.4 | QwenVLMax | 61.3 |
OpenSource Models | Acc | OpenSource Models | Acc |
InternLM-XComposer2 | 63.8 | InternVL-Chat-V1.5 | 65.6 |
IDEFICS2-8B | 60.8 | LLaVA-NeXT (Yi-34B) | 66.0 |
LLaVA-v1.5 (7B) | 54.8 | LLaVA-v1.5 (13B) | 55.3 |
Grok-v1.5 is not included since it's not publicly available.
Among the evaluated VLMs, GPT-4v (0409, high-res) achieves the best performance and significantly outperforms its low-res version (remember that RealWorldQA requires fine-grained recognition in high-res images). Meanwhile, top OpenSource VLMs also display competitive performace.
Hard Cases
We select a subset of questions that cannot be correctly answered by all of the Top-3 VLMs (GPT-4v (0409, high-res), InternVL-Chat-V1.5, LLaVA-NeXT (Yi-34B)). The subset includes 101 samples. We visualize several random samples in the subset below.
- Question: Is the car closest to us driving in the same direction as us or in the opposite direction from us.
- Choices: A. Same direction; B. Opposite direction.
- Answer: B
- Requirement: 1. Locate the closest car and find its direction; 2. Locate the lane we are in and infer the direction of us.
- Question: In which direction is the one-way sign in this scene facing?
- Choices: A. Left; B. Right
- Answer: B
- Requirement: Localize the one-way sign and find its direction
- Question: Are there some STOP signs?
- Choices: A. Yes; B. No
- Answer: A
- Requirement: Localize the stop sign (which is extremely small)
- Question: How many arrows are pointing right?
- Choices: A. 2; B. 3; C. 4
- Answer: B
- Requirement: Find all arrows on the road sign and recognize their directions
Takeaway
- RealWorldQA is a benchmark that requires VLMs to have the capability of: 1. Recognize details in high-res images (1080p, etc.); 2. Perform **reasoning based on recognition results **(may require commonsense knowledge)
- Performance Numbers: Random Guess - 37.7%; Best Proprietary VLM evaluated: GPT-4v (0409, high-res), 68%; Best OpenSource VLM evaluated: LLaVA-NeXT (Yi-34B), 66%
- You can use VLMEvalKit to evaluate your own VLM on RealWorldQA. Full evaluation results are available at Open VLM Leaderboard.