Papers
arxiv:2409.20018

Visual Context Window Extension: A New Perspective for Long Video Understanding

Published on Sep 30
· Submitted by hcwei on Oct 2
Authors:

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding capabilities in modeling long texts. Existing work attempts to address this issue by introducing long video-text pairs during training. However, these approaches require substantial computational and data resources. In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. We first conduct an in-depth analysis of why pretrained LMMs struggle to understand lengthy video content, identifying that discrepancies between visual and language modalities lead to different context windows for visual and language tokens, making it difficult to directly extend the visual tokens to match the language context window. Based on this, we propose to adapt LMMs for long video understanding tasks by extending the visual context window, eliminating the need for retraining on large scalelong video datasets. To further mitigate the significant memory consumption caused by long sequences, we introduce a progressive pooling inference strategy that selectively adjusts the spatial resolution of frame embeddings, reducing the number of visual tokens while retaining important spatial information. Across multiple long video understanding benchmarks, our method consistently improves the performance as the number of video frames increases. On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B. Additionally, in the 256-frame setting, our method reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.

Community

Paper author Paper submitter

In this paper, we address the long video understanding issue from the perspective of context windows, effectively avoiding the resource consumption associated with training from scratch.

  • By redefining the effective context window of LMMs into visual and language context windows, we propose the visual context window extension. This approach allows LMMs trained on short videos to be applied to long video understanding tasks without fine-tuning.
  • Additionally, we introduce a progressive pooling strategy to mitigate memory consumption issues caused by long sequences.
  • On the MLVU benchmark, our method outperforms GPT-4o, even though our model size is only 7B.
  • In a 256-frame setting, this strategy reduces memory usage by approximately 45% compared to the baseline, without introducing any performance loss.

We hope this work will advance research in long video understanding and provide insights for the design of future long video understanding models.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.20018 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.20018 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.