--- license: mit pipeline_tag: visual-question-answering --- # InternVL2-4B [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#model-usage) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376) ## Introduction We are excited to announce the release of InternVL 2.0, the latest addition to the InternVL series of multimodal large language models. InternVL 2.0 features a variety of instruction-tuned models, ranging from 2 billion to 108 billion parameters. This repository contains the instruction-tuned InternVL2-4B model. Compared to the state-of-the-art open-source multimodal large language models, InternVL 2.0 surpasses most open-source models. It demonstrates competitive performance on par with proprietary commercial models across various capabilities, including document and chart comprehension, infographics QA, scene text understanding and OCR tasks, scientific and mathematical problem solving, as well as cultural understanding and integrated multimodal capabilities. InternVL 2.0 is trained with an 8k context window and utilizes training data consisting of long texts, multiple images, and videos, significantly improving its ability to handle these types of inputs compared to InternVL 1.5. For more details, please refer to our blog and GitHub. ## Model Details InternVL2 is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. InternVL2-4B consists of [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px), an MLP projector, and [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct). ## Performance | Benchmark | PaliGemma-3B | Phi-3-Vision | Mini-InternVL-4B-1.5 | InternVL2-4B | | :--------------------------: | :----------: | :----------: | :------------------: | :----------: | | Model Size | 2.9B | 4.2B | 4.2B | 4.2B | | | | | | | | DocVQAtest | - | - | 87.7 | 89.2 | | ChartQAtest | - | 81.4 | 81.0 | 81.5 | | InfoVQAtest | - | - | 64.6 | 67.0 | | TextVQAval | 68.1 | 70.9 | 72.5 | 74.4 | | OCRBench | 614 | 639 | 638 | 788 | | MMEsum | 1686.1 | 1508.0 | 2053.6 | 2064.1 | | RealWorldQA | 55.2 | 58.8 | 60.1 | 60.7 | | AI2Dtest | 68.3 | 76.7 | 76.9 | 78.9 | | MMMUval | 34.9 | 40.4 | 43.3 | 47.0 | | MMBench-ENtest | 71.0 | 73.6 | 76.2 | 78.6 | | MMBench-CNtest | 63.6 | - | 70.3 | 73.9 | | CCBenchdev | 29.6 | 24.1 | 58.8 | 66.5 | | MMVetGPT-4-0613 | 33.1 | - | 46.7 | 55.7 | | SEED-Image | 69.6 | 70.9 | 72.5 | 73.7 | | HallBenchavg | 32.2 | 39.0 | 42.8 | 41.9 | | MathVistatestmini | 28.7 | 44.5 | 53.7 | 58.6 | - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. MMMU, OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit. - Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results. - It is important to mention that the MMVet scores we report are evaluated using GPT-4-0613 as the judge model. Different versions of GPT-4 can lead to significant variations in the scores for this dataset. For instance, using GPT-4-Turbo would result in significantly lower scores. ## Quick Start We provide an example code to run InternVL2-4B using `transformers`. > Please use transformers==4.37.2 to ensure the model works normally. ```python import torch import torchvision.transforms as T from PIL import Image from torchvision.transforms.functional import InterpolationMode from transformers import AutoModel, AutoTokenizer IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_STD = (0.229, 0.224, 0.225) def build_transform(input_size): MEAN, STD = IMAGENET_MEAN, IMAGENET_STD transform = T.Compose([ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD) ]) return transform def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): best_ratio_diff = float('inf') best_ratio = (1, 1) area = width * height for ratio in target_ratios: target_aspect_ratio = ratio[0] / ratio[1] ratio_diff = abs(aspect_ratio - target_aspect_ratio) if ratio_diff < best_ratio_diff: best_ratio_diff = ratio_diff best_ratio = ratio elif ratio_diff == best_ratio_diff: if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: best_ratio = ratio return best_ratio def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False): orig_width, orig_height = image.size aspect_ratio = orig_width / orig_height # calculate the existing image aspect ratio target_ratios = set( (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num) target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) # find the closest aspect ratio to the target target_aspect_ratio = find_closest_aspect_ratio( aspect_ratio, target_ratios, orig_width, orig_height, image_size) # calculate the target width and height target_width = image_size * target_aspect_ratio[0] target_height = image_size * target_aspect_ratio[1] blocks = target_aspect_ratio[0] * target_aspect_ratio[1] # resize the image resized_img = image.resize((target_width, target_height)) processed_images = [] for i in range(blocks): box = ( (i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size ) # split the image split_img = resized_img.crop(box) processed_images.append(split_img) assert len(processed_images) == blocks if use_thumbnail and len(processed_images) != 1: thumbnail_img = image.resize((image_size, image_size)) processed_images.append(thumbnail_img) return processed_images def load_image(image_file, input_size=448, max_num=6): image = Image.open(image_file).convert('RGB') transform = build_transform(input_size=input_size) images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(image) for image in images] pixel_values = torch.stack(pixel_values) return pixel_values path = 'OpenGVLab/InternVL2-4B' model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) # set the max number of tiles in `max_num` pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() generation_config = dict( num_beams=1, max_new_tokens=1024, do_sample=False, ) # pure-text conversation (纯文本对话) question = 'Hello, who are you?' response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') question = 'Can you tell me a story?' response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') # single-image single-round conversation (单图单轮对话) question = '\nPlease describe the image shortly.' response = model.chat(tokenizer, pixel_values, question, generation_config) print(f'User: {question}') print(f'Assistant: {response}') # single-image multi-round conversation (单图多轮对话) question = '\nPlease describe the image in detail.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') question = 'Please write a poem according to the image.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') # multi-image multi-round conversation (多图多轮对话) pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda() pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)] question = 'Image-1: \nImage-2: \nDescribe the two images in detail.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=None, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') question = 'What are the similarities and differences between these two images.' response, history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=history, return_history=True) print(f'User: {question}') print(f'Assistant: {response}') # batch inference, single image per sample (单图批处理) pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda() num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)] pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) questions = ['\nDescribe the image in detail.'] * len(num_patches_list) responses = model.batch_chat(tokenizer, pixel_values, num_patches_list=num_patches_list, questions=questions, generation_config=generation_config) for question, response in zip(questions, responses): print(f'User: {question}') print(f'Assistant: {response}') ``` ## License This project is released under the MIT license. ## Citation If you find this project useful in your research, please consider citing: ```BibTeX @article{chen2023internvl, title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks}, author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng}, journal={arXiv preprint arXiv:2312.14238}, year={2023} } @article{chen2024far, title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, journal={arXiv preprint arXiv:2404.16821}, year={2024} } ```