metadata

tags:
  - vision-language model
  - llama
  - video understanding

LLaMA-VID Model Card

Model details

LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token.

Model type: LLaMA-VID is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. We build this repo based on LLaVA.

Model date: llama-vid-7b-full-224-long-video was trained on 11/2023.

License

Where to send questions or comments about the model: https://github.com/dvlab-research/LLaMA-VID/issues

Intended use

Primary intended uses: The primary use of LLaMA-VID is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training data

This model is trained based on image data from LLaVA-1.5 dataset, and video data from WebVid and ActivityNet datasets, including

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
450K academic-task-oriented VQA data mixture.
40K ShareGPT data.
232K video-caption pairs sampled from the WebVid 2.5M dataset.
98K videos from ActivityNet with QA pairs from Video-ChatGPT.
15K video QA pairs from our Long VideoQA dataset.