omni-research commited on
Commit
a551ac4
1 Parent(s): 49a0f83

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -54
README.md CHANGED
@@ -1,54 +1,47 @@
1
- ---
2
- license: apache-2.0
3
- tags:
4
- - video LLM
5
- ---
6
-
7
-
8
- # Tarsier Model Card
9
- ## Model details
10
- **Model type:**
11
- Tarsier-34b is an open-source large-scale video-language models, which is designed to generate high-quality video descriptions, together with good capability of general video understanding (SOTA results on 6 open benchmarks).
12
-
13
- **Model date:**
14
- Tarsier-34b was trained in June 2024.
15
-
16
- **Paper or resources for more information:**
17
- - github repo: https://github.com/bytedance/tarsier
18
- - paper link: https://arxiv.org/abs/2407.00634
19
-
20
- ## License
21
- NousResearch/Nous-Hermes-2-Yi-34B license.
22
-
23
- **Where to send questions or comments about the model:**
24
- https://github.com/bytedance/tarsier/issues
25
-
26
- ## Intended use
27
- **Primary intended uses:**
28
- The primary use of Tarsier is research on large multimodal models, especially video description.
29
-
30
- **Primary intended users:**
31
- The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
32
-
33
- ## Training dataset
34
- Tarsier tasks a two-stage training strategy.
35
- 1. Stage-1: Multi-task Pre-training
36
-
37
- In stage-1, we trained our model across:
38
- - 10M diverse public datasets, such as video captioning, video question answering, action recognition, multi-image understanding, and text generation.
39
- - 3.5M in-house data, including 2.4M high-quality video caption data similar to WebVid and 1.1M videos with object-tracking (processed on videos from Webvid and HD-VILA by object tracking tool: [DEVA](https://github.com/hkchengrex/Tracking-Anything-with-DEVA))
40
- 2. Stage-2: Multi-grained Instruction Tuning
41
-
42
- In stage-2, we use 500K of in-house instruction tuning data, including:
43
- - Movie clips featuring multiple shots, subjects, or events, and had annotators provide descriptions varying in length and detail, from brief motion summaries to comprehensive narratives of visual details.
44
- - A dataset rich in camera motions, including zooming, translating, panning, and rotating.
45
- - Video-aware creative writing, such as poems, dialogues, speeches.
46
-
47
- ## Evaluation dataset
48
- - A challenging video desription dataset: [DREAM-1K](https://huggingface.co/datasets/omni-research/DREAM-1K)
49
- - Multi-choice VQA: [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [NeXT-QA](https://github.com/doc-doc/NExT-QA) and [Egoschema](https://drive.google.com/drive/folders/1SS0VVz8rML1e5gWq7D7VtP1oxE2UtmhQ)
50
- - Open-ended VQA: [MSVD-QA](https://opendatalab.com/OpenDataLab/MSVD), [MSR-VTT-QA](https://opendatalab.com/OpenDataLab/MSR-VTT), [ActivityNet-QA](https://github.com/MILVLG/activitynet-qa) and [TGIF-QA](https://opendatalab.com/OpenDataLab/TGIF-QA)
51
- - Video Caption: [MSVD-Caption](https://opendatalab.com/OpenDataLab/MSVD), [MSRVTT-Caption](https://opendatalab.com/OpenDataLab/MSR-VTT), [VATEX](https://eric-xw.github.io/vatex-website/about.html)
52
-
53
- ## How to Use
54
- see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - video LLM
5
+ ---
6
+
7
+
8
+ # Tarsier Model Card
9
+ ## Model details
10
+ **Model type:**
11
+ Tarsier-34b is an open-source large-scale video-language models, which is designed to generate high-quality video descriptions, together with good capability of general video understanding (SOTA results on 6 open benchmarks).
12
+
13
+ **Model date:**
14
+ Tarsier-34b was trained in June 2024.
15
+
16
+ **Paper or resources for more information:**
17
+ - github repo: https://github.com/bytedance/tarsier
18
+ - paper link: https://arxiv.org/abs/2407.00634
19
+
20
+ ## License
21
+ NousResearch/Nous-Hermes-2-Yi-34B license.
22
+
23
+ **Where to send questions or comments about the model:**
24
+ https://github.com/bytedance/tarsier/issues
25
+
26
+ ## Intended use
27
+ **Primary intended uses:**
28
+ The primary use of Tarsier is research on large multimodal models, especially video description.
29
+
30
+ **Primary intended users:**
31
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
32
+
33
+ ## Training dataset
34
+ Tarsier tasks a two-stage training strategy.
35
+ - Stage-1: Multi-task Pre-training on 13M data
36
+ - Stage-2: Multi-grained Instruction Tuning on 500K data
37
+
38
+ In both stages, we freeze ViT and train all the parameters of projection layer and LLM.
39
+
40
+ ## Evaluation dataset
41
+ - A challenging video desription dataset: [DREAM-1K](https://huggingface.co/datasets/omni-research/DREAM-1K)
42
+ - Multi-choice VQA: [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [NeXT-QA](https://github.com/doc-doc/NExT-QA) and [Egoschema](https://drive.google.com/drive/folders/1SS0VVz8rML1e5gWq7D7VtP1oxE2UtmhQ)
43
+ - Open-ended VQA: [MSVD-QA](https://opendatalab.com/OpenDataLab/MSVD), [MSR-VTT-QA](https://opendatalab.com/OpenDataLab/MSR-VTT), [ActivityNet-QA](https://github.com/MILVLG/activitynet-qa) and [TGIF-QA](https://opendatalab.com/OpenDataLab/TGIF-QA)
44
+ - Video Caption: [MSVD-Caption](https://opendatalab.com/OpenDataLab/MSVD), [MSRVTT-Caption](https://opendatalab.com/OpenDataLab/MSR-VTT), [VATEX](https://eric-xw.github.io/vatex-website/about.html)
45
+
46
+ ## How to Use
47
+ see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage