--- license: apache-2.0 language: - en base_model: - mistralai/Mistral-7B-Instruct-v0.2 tags: - video temporal grounding - dense video caption - video highlight detection ---

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

If our project helps you, please give us a star ⭐ on GitHub and cite our paper!
## 📰 News - **[2024.11.01]** 🔥 We are excited to announce the release of [trace-uni](https://huggingface.co/Yongxin-Guo/trace-uni), which has been enhanced by incorporating additional general video understanding data from a subset of [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). Our results indicate that trace-uni outperforms trace in both VTG tasks and general video understanding tasks. - **[2024.10.19]** 🔥 We release [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks. - **[2024.10.10]** 🔥 Our [code](https://github.com/gyxxyg/TRACE) and [paper](https://arxiv.org/abs/2410.05643) are released! - **[2024.10.10]** 🔥 Our **checkpoints** are available now! ## Overview In this work - We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure. - We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions. ## Model Zoo | Checkpoints | Description | URL | | ----------- | ----------- | ----------- | | Initialization | Weights initialized from VideoLLaMA2 | [trace-init](https://huggingface.co/Yongxin-Guo/trace-init) | | Stage-1 | Model checkpoints trained after stage-1 | [trace-stage1](https://huggingface.co/Yongxin-Guo/trace-stage1) | | Stage-2 | Model checkpoints trained after stage-2 | [trace](https://huggingface.co/Yongxin-Guo/trace) | | FT-Charades | Fine-tuned on Charades-STA dataset | [trace-ft-charades](https://huggingface.co/Yongxin-Guo/trace-ft-charades) | | FT-Youcook2 | Fine-tuned on Youcook2 dataset | [trace-ft-youcook2](https://huggingface.co/Yongxin-Guo/trace-ft-youcook2) | | FT-QVHighlights | Fine-tuned on QVHighlights dataset | [trace-ft-qvhighlights](https://huggingface.co/Yongxin-Guo/trace-ft-qvhighlights) | | TRACE-retrieval | Forcing the predicted timestamps to be align with input timestamps | [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) | | TRACE-uni | Incorporating additional general video understanding data from a subset of [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). | [trace-uni](https://huggingface.co/Yongxin-Guo/trace-uni) | #### Results | Youcook2 (Zero-Shot) | CIDER | METEOR | SODA_c | F1 | | --- | --- | --- | --- | --- | | TRACE | 8.1 | 2.8 | 2.2 | 22.4 | | TRACE-retrieval | 8.3 | 2.9 | 2.3 | 24.1 | | Charades-STA (Zero-Shot) | 0.3 | 0.5 | 0.7 | mIOU | | --- | --- | --- | --- | --- | | TRACE | 58.6 | 40.3 | 19.4 | 38.7 | | TRACE-retrieval | 57.9 | 37.4 | 17.3 | 37.4 | | QVHighlights (Zero-Shot) | mAP | Hit@1 | | --- | --- | --- | | TRACE | 26.8 | 42.7 | | TRACE-retrieval | 27.9 | 44.3 | | ActivityNet-DVC | CIDER | METEOR | SODA_c | F1 | | --- | --- | --- | --- | --- | | TRACE | 25.9 | 6.0 | 6.4 | 39.3 | | TRACE-retrieval | 25.7 | 5.9 | 6.5 | 40.1 | | ActivityNet-MR | 0.3 | 0.5 | 0.7 | mIOU | | --- | --- | --- | --- | --- | | TRACE | 54.0 | 37.7 | 24.0 | 39.0 | | TRACE-retrieval | 54.4 | 39.8 | 24.9 | 40.2 |