RealWorldQA, What's New?

Community Article Published April 25, 2024

This is a short blog that introduce the RealWorldQA Benchmark.

What is RealWorldQA?

RealWorldQA is a benchmark designed to evaluate the real-world spatial understanding capabilities of multimodal AI models, contributed by XAI. It assesses how well these models comprehend physical environments. The benchmark consists of 700+ images, each accompanied by a question and a verifiable answer. These images are drawn from real-world scenarios, including those captured from vehicles. The goal is to advance AI models' understanding of our physical world.

Statistics & Info

Name	Type	#Questions	Data Quality* (manually verified 10% samples)	Finegrained Classes
RealWorldQA	MCQ	765	> 97%	No

TL;DR: **RealWorldQA **is a benchmark that requires VLMs to have the capability of:

Recognize details in high-res images (1080p, etc.).
Perform reasoning based on recognition results (may require commonsense knowledge).

*Data Quality: We perform manual verification on 10% samples and check if each sample is correct and unambiguous. Most samples (>97%) in RealWorldQA are good and clear.

Some cases I found ambiguous like:

Question: Where is the dog in relation to the door?
Choices: A. The dog is behind the door; B. The dog is next to the door; C. The dog is in front of the door.
Answer: A
Why ambiguous: The dog is actually between two doors.

Question: How far from the camera is the rightmost vehicle?
Choices: A. 15 meters; B. 35 meters; C. 55 meters.
Answer: C
Why ambiguous: Is the rightmost car that far?

Performance

Questions in RealWorldQA have 2 - 4 candidate choices (the majority have 3 choices), the expectation of RandomGuess Top-1 accuracy would be 37.7%.

We perform the evaluation using VLMEvalKit and list the performance of representative VLMs (proprietarty or opensource) below:

Proprietary Models	Acc	Proprietary Models	Acc
GPT-4v (0409, low-res)	61.4	GPT-4v (0409, high-res)	68.0
GeminiPro-V (1.0)	60.4	QwenVLMax	61.3
OpenSource Models	Acc	OpenSource Models	Acc
InternLM-XComposer2	63.8	InternVL-Chat-V1.5	65.6
IDEFICS2-8B	60.8	LLaVA-NeXT (Yi-34B)	66.0
LLaVA-v1.5 (7B)	54.8	LLaVA-v1.5 (13B)	55.3

Grok-v1.5 is not included since it's not publicly available.

Among the evaluated VLMs, GPT-4v (0409, high-res) achieves the best performance and significantly outperforms its low-res version (remember that RealWorldQA requires fine-grained recognition in high-res images). Meanwhile, top OpenSource VLMs also display competitive performace.

Hard Cases

We select a subset of questions that cannot be correctly answered by all of the Top-3 VLMs (GPT-4v (0409, high-res), InternVL-Chat-V1.5, LLaVA-NeXT (Yi-34B)). The subset includes 101 samples. We visualize several random samples in the subset below.

Question: Is the car closest to us driving in the same direction as us or in the opposite direction from us.
Choices: A. Same direction; B. Opposite direction.
Answer: B
Requirement: 1. Locate the closest car and find its direction; 2. Locate the lane we are in and infer the direction of us.

Question: In which direction is the one-way sign in this scene facing?
Choices: A. Left; B. Right
Answer: B
Requirement: Localize the one-way sign and find its direction

Question: Are there some STOP signs?
Choices: A. Yes; B. No
Answer: A
Requirement: Localize the stop sign (which is extremely small)

Question: How many arrows are pointing right?
Choices: A. 2; B. 3; C. 4
Answer: B
Requirement: Find all arrows on the road sign and recognize their directions

Takeaway

RealWorldQA is a benchmark that requires VLMs to have the capability of: 1. Recognize details in high-res images (1080p, etc.); 2. Perform **reasoning based on recognition results **(may require commonsense knowledge)
Performance Numbers: Random Guess - 37.7%; Best Proprietary VLM evaluated: GPT-4v (0409, high-res), 68%; Best OpenSource VLM evaluated: LLaVA-NeXT (Yi-34B), 66%
You can use VLMEvalKit to evaluate your own VLM on RealWorldQA. Full evaluation results are available at Open VLM Leaderboard.

Upvote