Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Abstract
Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.
Community
The paper introduces a comprehensive framework and a novel dataset to evaluate Vision-Language Models (VLMs) for Visual Question-Answering (VQA) tasks across task types, application domains, and knowledge types, along with GoEval, a new multimodal evaluation metric for aligning model outputs with human judgments.
Framework & Dataset: The paper proposes a new dataset and framework to evaluate VLMs for VQA tasks, categorizing tasks by type, domain, and knowledge requirements.
Novel Evaluation Metric: It introduces GoEval, a multimodal evaluation metric using GPT-4o, which aligns more closely with human judgments than traditional metrics.
Model Analysis: The study evaluates 10 VLMs, highlighting the need for task-specific model selection, and shows that no single model performs best across all tasks, with proprietary models outperforming in most cases.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model? (2024)
- Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering (2024)
- Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering (2024)
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity (2024)
- MAPWise: Evaluating Vision-Language Models for Advanced Map Queries (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper