Grounded-VideoLLM Model Card

Grounded-VideoLLM is a Video-LLM adept at fine-grained temporal grounding, which not only excels in grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

Model details

Model date:

Grounded-VideoLLM-Phi3.5-Vision-Instruct-4B was trained in Oct. 2024.

Grounded-VideoLLM-LLaVA-Next-Llama3-8B was trained in Oct. 2024.

Paper or resources for more information: Paper, Code

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{wang2024grounded,
  title={Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models},
  author={Wang, Haibo and Xu, Zhiyang and Cheng, Yu and Diao, Shizhe and Zhou, Yufan and Cao, Yixin and Wang, Qifan and Ge, Weifeng and Huang, Lifu},
  journal={arXiv preprint arXiv:2410.03290},
  year={2024}
}