Spaces:
Running
Finding hard tasks for vision models, though easy for humans: MAD magazine 'fold-ins'
Hi everyone, and thank you, WildVision team, for making this awesome tool available. I selfishly hope you and Hugging Face can cope with the scaling needs as people notice this exists and demand and usage surge π
After sharing a review and a few usage examples on my website (https://talkingtochatbots.com/vision), I wanted to share another one that came to mind as I was flipping through a very old (1980) MAD magazine... Mad fold-ins (https://en.wikipedia.org/wiki/Mad_Fold-in) are visual riddles that one can guess by inspecting the image, and then reveal the solution by simply folding a magazine page.
Surprisingly, this was a very hard task for any of the vision models. After many tries in the Arena, none of the bots could even get close to solving the very simple case I'm sharing here. Even in the ChatGPT Plus app, after some intense "encouraging prompting" ('I pay $23 a month,' 'Don't be lazy,' 'Even a toddler chatbot would solve this...' π), it tried to build a Python script to simulate the folding but was still unsuccessful (see screen capture). Just a fun example that could encourage people to test, and developers to improve their models...
Thank you for your interest in our work. We've noticed the increase in usage and are actively working on scaling our resources to meet the demand.
The example you provided is quite innovative! Indeed, we aim to facilitate the community in identifying the limitations of these models, thereby enabling continuous enhancements.
The MAD magazine 'fold-ins' is really interesting. As a human, I managed to solve the question by first finding out the subsections in the image divided by the folding lines and then looking for the right of the left-most subsection and the left of the right-most subsection. I think the model needs to really understand the layout of the subsections in the image in order to solve the question.
I think this magazine image, as well as some web screenshots and posters, are all multipanel images. At this point, I'd like to promote my own work that analyses Large Vision-Language Models (LVLMs) on multipanel images. π
π Website: https://sites.google.com/view/multipanelvqa/home
π Paper: https://arxiv.org/abs/2401.15847
π€ Data: https://huggingface.co/datasets/yfan1997/MultipanelVQA_real-world/
Our work introduces Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark that specifically challenges models in comprehending multipanel images. The benchmark comprises 6,600 questions and answers related to multipanel images. As pointed out by @reddgr , we also find while these questions are straightforward for average humans, achieving nearly perfect correctness, they pose significant challenges to the state-of-the-art LVLMs we tested. Additionally, we provided a comprehensive error analysis π, where we utilized synthetically curated multipanel images specifically designed to isolate and evaluate the impact of diverse factors on model performance, revealing the sensitivity of LVLMs to various interferences in multipanel images, such as adjacent subfigures and layout complexity.