If our project helps you, please give us a star β on GitHub and cite our paper!
π° News
- [2024.11.01] π₯ We are excited to announce the release of trace-uni, which has been enhanced by incorporating additional general video understanding data from a subset of LLaVA-Video-178k. Our results indicate that trace-uni outperforms trace in both VTG tasks and general video understanding tasks.
- [2024.10.19] π₯ We release trace-retrieval by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks.
- [2024.10.10] π₯ Our code and paper are released!
- [2024.10.10] π₯ Our checkpoints are available now!
Overview
In this work
- We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
- We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.
Model Zoo
Checkpoints |
Description |
URL |
Initialization |
Weights initialized from VideoLLaMA2 |
trace-init |
Stage-1 |
Model checkpoints trained after stage-1 |
trace-stage1 |
Stage-2 |
Model checkpoints trained after stage-2 |
trace |
FT-Charades |
Fine-tuned on Charades-STA dataset |
trace-ft-charades |
FT-Youcook2 |
Fine-tuned on Youcook2 dataset |
trace-ft-youcook2 |
FT-QVHighlights |
Fine-tuned on QVHighlights dataset |
trace-ft-qvhighlights |
TRACE-retrieval |
Forcing the predicted timestamps to be align with input timestamps |
trace-retrieval |
TRACE-uni |
Incorporating additional general video understanding data from a subset of LLaVA-Video-178k. |
trace-uni |
Results
Youcook2 (Zero-Shot) |
CIDER |
METEOR |
SODA_c |
F1 |
TRACE |
8.1 |
2.8 |
2.2 |
22.4 |
TRACE-retrieal |
8.3 |
2.9 |
2.3 |
24.1 |
TRACE-uni |
8.6 |
2.9 |
2.3 |
22.4 |
Charades-STA (Zero-Shot) |
0.3 |
0.5 |
0.7 |
mIOU |
TRACE |
58.6 |
40.3 |
19.4 |
38.7 |
TRACE-retrieval |
57.9 |
37.4 |
17.3 |
37.4 |
TRACE-uni |
63.7 |
43.7 |
21.0 |
41.5 |
QVHighlights (Zero-Shot) |
mAP |
Hit@1 |
TRACE |
26.8 |
42.7 |
TRACE-retrieval |
27.9 |
44.3 |
TRACE-uni |
27.5 |
43.9 |
ActivityNet-DVC |
CIDER |
METEOR |
SODA_c |
F1 |
TRACE |
25.9 |
6.0 |
6.4 |
39.3 |
TRACE-retrieval |
25.7 |
5.9 |
6.5 |
40.1 |
TRACE-uni |
29.2 |
6.9 |
6.4 |
40.4 |
ActivityNet-MR |
0.3 |
0.5 |
0.7 |
mIOU |
TRACE |
54.0 |
37.7 |
24.0 |
39.0 |
TRACE-retrieval |
54.4 |
39.8 |
24.9 |
40.2 |
TRACE-uni |
53.2 |
38.2 |
24.7 |
39.4 |
MVBench |
Avg |
AS |
AP |
AA |
FA |
UA |
OE |
OI |
OS |
MD |
AL |
ST |
AC |
MC |
MA |
SC |
FP |
CO |
EN |
ER |
CI |
TRACE |
48.1 |
61.2 |
56.5 |
72.5 |
46.5 |
61.0 |
48.0 |
69.5 |
40.0 |
22.0 |
31.0 |
86.5 |
37.5 |
37.0 |
51.0 |
45.0 |
40.5 |
39.0 |
31.0 |
43.5 |
44.5 |
TRACE-uni |
53.8 |
68.1 |
58.5 |
72.5 |
41.5 |
73.5 |
55.1 |
71.5 |
40.5 |
25.0 |
53.0 |
88.5 |
63.5 |
38.5 |
51.0 |
52.5 |
49.0 |
59.5 |
33.5 |
49.5 |
32.5 |
VideoMME (w/o subtitle) |
Short |
Midium |
Long |
Avg |
TRACE |
49.5 |
42.5 |
39.3 |
43.8 |
TRACE-uni |
58.2 |
48.1 |
42.3 |
49.6 |
Bibliography
If you find this repository helpful for your project, please consider citing:
@misc{guo2024tracetemporalgroundingvideo,
title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling},
author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen},
year={2024},
eprint={2410.05643},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.05643},
}