--- license: cc datasets: - liuhaotian/LLaVA-Instruct-150K - liuhaotian/LLaVA-Pretrain language: - en pipeline_tag: video-text-to-text --- # Model Card for LLaVA-Video-LLaMA-3 Please follow my github repo [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM. ## Updates - [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks. - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models. ## Model Details - Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10). - Template: We follow the LLaVA-v1 template for constructing the conversation. - Architecture: LLaVA architecture, visual encoder + MLP adapter + LLM backbone ## How to Use Please firstly install llava via ``` pip install git+https://github.com/Victorwz/LLaVA-Unified.git ``` You can load the model and perform inference as follows: ```python from llava.conversation import conv_templates, SeparatorStyle from llava.model.builder import load_pretrained_model from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path from PIL import Image import requests import cv2 import torch import base64 import io from io import BytesIO import numpy as np # load model and processor device = "cuda" if torch.cuda.is_available() else "cpu" model_name = get_model_name_from_path("weizhiwang/LLaVA-Video-Llama-3") tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Video-Llama-3", None, model_name, False, False, device=device) # prepare image input url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4" def read_video(video_url): response = requests.get(url) if response.status_code != 200: print("Failed to download video") exit() else: with open("tmp_video.mp4", 'wb') as f: for chunk in response.iter_content(chunk_size=1024): f.write(chunk) video = cv2.VideoCapture("tmp_video.mp4") base64Frames = [] while video.isOpened(): success, frame = video.read() if not success: break _, buffer = cv2.imencode(".jpg", frame) base64Frames.append(base64.b64encode(buffer).decode("utf-8")) video.release() print(len(base64Frames), "frames read.") return base64Frames video_frames = read_video(video_url=url) image_tensors = [] samplng_interval = int(len(video_frames) / 10) for i in range(0, len(video_frames), samplng_interval): rawbytes = base64.b64decode(video_frames[i]) image = Image.open(io.BytesIO(rawbytes)).convert("RGB") image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0].half().cuda() image_tensors.append(image_tensor) # prepare inputs for the model text = "\n".join(['' for i in range(len(image_tensors))]) + '\n' + "Why is this video funny" conv = conv_templates["llama_3"].copy() conv.append_message(conv.roles[0], text) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda() # autoregressively generate text with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensors, do_sample=False, max_new_tokens=512, use_cache=True) outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True) print(outputs[0]) ``` The image caption results look like: ``` The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood. ``` # Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data Please refer to our [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer. ## Citation ```bibtex @misc{wang2024llavavideollama3, title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone}, author={Wang, Weizhi}, year={2024} } ```