VideoGameBunny: Towards vision assistants for video games
Abstract
Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/
Community
This paper introduces VideoGameBunny, a specialized large multimodal model trained on a video game image dataset, outperforming larger models in game-related tasks despite its smaller size.
Code and Dataset are available here: https://videogamebunny.github.io/
Hi @taesiri congrats on your work!
Would be cool to link the dataset to this paper, see here on how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper
Cheers,
Niels
Open-source @ HF
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding (2024)
- AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding (2024)
- MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding (2024)
- Towards Event-oriented Long Video Understanding (2024)
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper