|
--- |
|
license: mit |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
# InternVL2-4B |
|
|
|
[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) |
|
|
|
[\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#model-usage) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376) |
|
|
|
## Introduction |
|
|
|
We are excited to announce the release of InternVL 2.0, the latest addition to the InternVL series of multimodal large language models. InternVL 2.0 features a variety of instruction-tuned models, ranging from 2 billion to 108 billion parameters. This repository contains the instruction-tuned InternVL2-4B model. |
|
|
|
Compared to the state-of-the-art open-source multimodal large language models, InternVL 2.0 surpasses most open-source models. It demonstrates competitive performance on par with proprietary commercial models across various capabilities, including document and chart comprehension, infographics QA, scene text understanding and OCR tasks, scientific and mathematical problem solving, as well as cultural understanding and integrated multimodal capabilities. |
|
|
|
InternVL 2.0 is trained with an 8k context window and utilizes training data consisting of long texts, multiple images, and videos, significantly improving its ability to handle these types of inputs compared to InternVL 1.5. For more details, please refer to our blog and GitHub. |
|
|
|
## Model Details |
|
|
|
InternVL2 is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. InternVL2-4B consists of [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px), an MLP projector, and [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct). |
|
|
|
## Performance |
|
|
|
| Benchmark | PaliGemma-3B | Phi-3-Vision | Mini-InternVL-4B-1.5 | InternVL2-4B | |
|
| :--------------------------: | :----------: | :----------: | :------------------: | :----------: | |
|
| Model Size | 2.9B | 4.2B | 4.2B | 4.2B | |
|
| | | | | | |
|
| DocVQA<sub>test</sub> | - | - | 87.7 | 89.2 | |
|
| ChartQA<sub>test</sub> | - | 81.4 | 81.0 | 81.5 | |
|
| InfoVQA<sub>test</sub> | - | - | 64.6 | 67.0 | |
|
| TextVQA<sub>val</sub> | 68.1 | 70.9 | 72.5 | 74.4 | |
|
| OCRBench | 614 | 639 | 638 | 788 | |
|
| MME<sub>sum</sub> | 1686.1 | 1508.0 | 2053.6 | 2064.1 | |
|
| RealWorldQA | 55.2 | 58.8 | 60.1 | 60.7 | |
|
| AI2D<sub>test</sub> | 68.3 | 76.7 | 76.9 | 78.9 | |
|
| MMMU<sub>val</sub> | 34.9 | 40.4 | 43.3 | 47.0 | |
|
| MMBench-EN<sub>test</sub> | 71.0 | 73.6 | 76.2 | 78.6 | |
|
| MMBench-CN<sub>test</sub> | 63.6 | - | 70.3 | 73.9 | |
|
| CCBench<sub>dev</sub> | 29.6 | 24.1 | 58.8 | 66.5 | |
|
| MMVet<sub>GPT-4-0613</sub> | 33.1 | - | 46.7 | 55.7 | |
|
| SEED-Image | 69.6 | 70.9 | 72.5 | 73.7 | |
|
| HallBench<sub>avg</sub> | 32.2 | 39.0 | 42.8 | 41.9 | |
|
| MathVista<sub>testmini</sub> | 28.7 | 44.5 | 53.7 | 58.6 | |
|
|
|
- We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. MMMU, OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit. |
|
|
|
- Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results. |
|
|
|
- It is important to mention that the MMVet scores we report are evaluated using GPT-4-0613 as the judge model. Different versions of GPT-4 can lead to significant variations in the scores for this dataset. For instance, using GPT-4-Turbo would result in significantly lower scores. |
|
|
|
## Quick Start |
|
|
|
We provide an example code to run InternVL2-4B using `transformers`. |
|
|
|
> Please use transformers==4.37.2 to ensure the model works normally. |
|
|
|
```python |
|
import torch |
|
import torchvision.transforms as T |
|
from PIL import Image |
|
from torchvision.transforms.functional import InterpolationMode |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
IMAGENET_MEAN = (0.485, 0.456, 0.406) |
|
IMAGENET_STD = (0.229, 0.224, 0.225) |
|
|
|
|
|
def build_transform(input_size): |
|
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD |
|
transform = T.Compose([ |
|
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), |
|
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), |
|
T.ToTensor(), |
|
T.Normalize(mean=MEAN, std=STD) |
|
]) |
|
return transform |
|
|
|
|
|
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): |
|
best_ratio_diff = float('inf') |
|
best_ratio = (1, 1) |
|
area = width * height |
|
for ratio in target_ratios: |
|
target_aspect_ratio = ratio[0] / ratio[1] |
|
ratio_diff = abs(aspect_ratio - target_aspect_ratio) |
|
if ratio_diff < best_ratio_diff: |
|
best_ratio_diff = ratio_diff |
|
best_ratio = ratio |
|
elif ratio_diff == best_ratio_diff: |
|
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: |
|
best_ratio = ratio |
|
return best_ratio |
|
|
|
|
|
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False): |
|
orig_width, orig_height = image.size |
|
aspect_ratio = orig_width / orig_height |
|
|
|
# calculate the existing image aspect ratio |
|
target_ratios = set( |
|
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if |
|
i * j <= max_num and i * j >= min_num) |
|
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) |
|
|
|
# find the closest aspect ratio to the target |
|
target_aspect_ratio = find_closest_aspect_ratio( |
|
aspect_ratio, target_ratios, orig_width, orig_height, image_size) |
|
|
|
# calculate the target width and height |
|
target_width = image_size * target_aspect_ratio[0] |
|
target_height = image_size * target_aspect_ratio[1] |
|
blocks = target_aspect_ratio[0] * target_aspect_ratio[1] |
|
|
|
# resize the image |
|
resized_img = image.resize((target_width, target_height)) |
|
processed_images = [] |
|
for i in range(blocks): |
|
box = ( |
|
(i % (target_width // image_size)) * image_size, |
|
(i // (target_width // image_size)) * image_size, |
|
((i % (target_width // image_size)) + 1) * image_size, |
|
((i // (target_width // image_size)) + 1) * image_size |
|
) |
|
# split the image |
|
split_img = resized_img.crop(box) |
|
processed_images.append(split_img) |
|
assert len(processed_images) == blocks |
|
if use_thumbnail and len(processed_images) != 1: |
|
thumbnail_img = image.resize((image_size, image_size)) |
|
processed_images.append(thumbnail_img) |
|
return processed_images |
|
|
|
|
|
def load_image(image_file, input_size=448, max_num=6): |
|
image = Image.open(image_file).convert('RGB') |
|
transform = build_transform(input_size=input_size) |
|
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) |
|
pixel_values = [transform(image) for image in images] |
|
pixel_values = torch.stack(pixel_values) |
|
return pixel_values |
|
|
|
|
|
path = 'OpenGVLab/InternVL2-4B' |
|
model = AutoModel.from_pretrained( |
|
path, |
|
torch_dtype=torch.bfloat16, |
|
low_cpu_mem_usage=True, |
|
trust_remote_code=True).eval().cuda() |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) |
|
# set the max number of tiles in `max_num` |
|
pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() |
|
|
|
generation_config = dict( |
|
num_beams=1, |
|
max_new_tokens=1024, |
|
do_sample=False, |
|
) |
|
|
|
# pure-text conversation (纯文本对话) |
|
question = 'Hello, who are you?' |
|
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True) |
|
print(f'User: {question}') |
|
print(f'Assistant: {response}') |
|
|
|
question = 'Can you tell me a story?' |
|
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True) |
|
print(f'User: {question}') |
|
print(f'Assistant: {response}') |
|
|
|
# single-image single-round conversation (单图单轮对话) |
|
question = '<image>\nPlease describe the image shortly.' |
|
response = model.chat(tokenizer, pixel_values, question, generation_config) |
|
print(f'User: {question}') |
|
print(f'Assistant: {response}') |
|
|
|
# single-image multi-round conversation (单图多轮对话) |
|
question = '<image>\nPlease describe the image in detail.' |
|
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True) |
|
print(f'User: {question}') |
|
print(f'Assistant: {response}') |
|
|
|
question = 'Please write a poem according to the image.' |
|
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True) |
|
print(f'User: {question}') |
|
print(f'Assistant: {response}') |
|
|
|
# multi-image multi-round conversation (多图多轮对话) |
|
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() |
|
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda() |
|
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) |
|
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)] |
|
|
|
question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.' |
|
response, history = model.chat(tokenizer, pixel_values, question, generation_config, |
|
num_patches_list=num_patches_list, |
|
history=None, return_history=True) |
|
print(f'User: {question}') |
|
print(f'Assistant: {response}') |
|
|
|
question = 'What are the similarities and differences between these two images.' |
|
response, history = model.chat(tokenizer, pixel_values, question, generation_config, |
|
num_patches_list=num_patches_list, |
|
history=history, return_history=True) |
|
print(f'User: {question}') |
|
print(f'Assistant: {response}') |
|
|
|
# batch inference, single image per sample (单图批处理) |
|
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda() |
|
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda() |
|
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)] |
|
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0) |
|
|
|
questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list) |
|
responses = model.batch_chat(tokenizer, pixel_values, |
|
num_patches_list=num_patches_list, |
|
questions=questions, |
|
generation_config=generation_config) |
|
for question, response in zip(questions, responses): |
|
print(f'User: {question}') |
|
print(f'Assistant: {response}') |
|
``` |
|
|
|
## License |
|
|
|
This project is released under the MIT license. |
|
|
|
## Citation |
|
|
|
If you find this project useful in your research, please consider citing: |
|
|
|
```BibTeX |
|
@article{chen2023internvl, |
|
title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks}, |
|
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng}, |
|
journal={arXiv preprint arXiv:2312.14238}, |
|
year={2023} |
|
} |
|
@article{chen2024far, |
|
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, |
|
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, |
|
journal={arXiv preprint arXiv:2404.16821}, |
|
year={2024} |
|
} |
|
``` |
|
|