arxiv:2409.09269

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Published on Sep 14

· Submitted by

amanchadha on Sep 17

Upvote

Authors:

Neelabh Sinha ,

Aman Chadha

Abstract

Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter Sep 17

•

edited Sep 17

The paper introduces a comprehensive framework and a novel dataset to evaluate Vision-Language Models (VLMs) for Visual Question-Answering (VQA) tasks across task types, application domains, and knowledge types, along with GoEval, a new multimodal evaluation metric for aligning model outputs with human judgments.
Framework & Dataset: The paper proposes a new dataset and framework to evaluate VLMs for VQA tasks, categorizing tasks by type, domain, and knowledge requirements.
Novel Evaluation Metric: It introduces GoEval, a multimodal evaluation metric using GPT-4o, which aligns more closely with human judgments than traditional metrics.
Model Analysis: The study evaluates 10 VLMs, highlighting the need for task-specific model selection, and shows that no single model performs best across all tasks, with proprietary models outperforming in most cases.

librarian-bot

Sep 18

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.09269 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.09269 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.09269 in a Space README.md to link it from this page.