OpenGVLab
/

InternVL2-Llama3-76B

@@ -29,23 +29,23 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
 | :--------------------------: | :-------------: | :------------: | :-----------: | :------------------: |
 |          Model Size          |        -        |       -        |      40B      |         76B          |
 |                              |                 |                |               |                      |
-|    DocVQA<sub>test</sub>     |      87.2       |      86.5      |     93.9      |         TODO         |
 |    ChartQA<sub>test</sub>    |      78.1       |      81.3      |     86.2      |         88.4         |
 |    InfoVQA<sub>test</sub>    |        -        |      72.7      |     78.7      |         82.0         |
 |    TextVQA<sub>val</sub>     |        -        |      73.5      |     83.0      |         84.4         |
-|           OCRBench           |       678       |      754       |      837      |         TODO         |
 |      MME<sub>sum</sub>       |     2070.2      |     2110.6     |    2315.0     |        2414.7        |
-|         RealWorldQA          |      68.0       |      67.5      |     71.8      |         TODO         |
 |     AI2D<sub>test</sub>      |      89.4       |      80.3      |     87.1      |         87.6         |
-|      MMMU<sub>val</sub>      |      63.1       |      58.5      |     53.9      |         55.2         |
 |  MMBench-EN<sub>test</sub>   |      81.0       |      73.9      |     86.8      |         86.5         |
 |  MMBench-CN<sub>test</sub>   |      80.2       |      73.8      |     86.5      |         86.3         |
 |    CCBench<sub>dev</sub>     |      57.3       |      28.4      |     80.6      |         81.0         |
 |  MMVet<sub>GPT-4-0613</sub>  |        -        |       -        |     68.5      |         69.8         |
-| MMVet<sub>GPT-4-Turbo</sub>  |      67.5       |      64.0      |     65.5      |         TODO         |
 |          SEED-Image          |        -        |       -        |     78.2      |         78.2         |
-|   HallBench<sub>avg</sub>    |      43.9       |      45.6      |     56.9      |         TODO         |
-| MathVista<sub>testmini</sub> |      58.1       |      57.7      |     63.7      |         TODO         |
 - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. MMMU, OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
@@ -59,7 +59,7 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
 | :------------------: | :----: | :------: | :--------------: | :-----------: | :------------------: |
 |      Model Size      |   -    |   34B    |       34B        |      40B      |         76B          |
 |                      |        |          |                  |               |                      |
-|       MVBench        |   -    |    -     |        -         |     72.5      |         TODO         |
 | Video-MME<br>wo subs |  59.9  |   59.0   |       52.0       |     TODO      |         TODO         |
 | Video-MME<br>w/ subs |  63.3  |   59.4   |       54.9       |     TODO      |         TODO         |
@@ -76,6 +76,7 @@ We also welcome you to experience the InternVL2 series models in our [online dem
 > Please use transformers==4.37.2 to ensure the model works normally.
 ```python
 import numpy as np
 import torch
 import torchvision.transforms as T
@@ -163,17 +164,44 @@ def load_image(image_file, input_size=448, max_num=6):
     return pixel_values
 path = 'OpenGVLab/InternVL2-Llama3-76B'
-# You need to set device_map='auto' to use multiple GPUs for inference.
-import os
-os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
     low_cpu_mem_usage=True,
     trust_remote_code=True,
-    device_map='auto').eval()
 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
 # set the max number of tiles in `max_num`
 pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
@@ -317,6 +345,10 @@ print(f'User: {question}')
 print(f'Assistant: {response}')
 ```
 ## Deployment
 ### LMDeploy
@@ -374,23 +406,23 @@ InternVL 2.0 是一个多模态大语言模型系列，包含各种规模的模
 | :--------------------------: | :-------------: | :------------: | :-----------: | :------------------: |
 |           模型大小           |        -        |       -        |      40B      |         76B          |
 |                              |                 |                |               |                      |
-|    DocVQA<sub>test</sub>     |      87.2       |      86.5      |     93.9      |                      |
-|    ChartQA<sub>test</sub>    |      78.1       |      81.3      |     86.2      |                      |
-|    InfoVQA<sub>test</sub>    |        -        |      72.7      |     78.7      |                      |
-|    TextVQA<sub>val</sub>     |        -        |      73.5      |     83.0      |                      |
-|           OCRBench           |       678       |      754       |      837      |                      |
-|      MME<sub>sum</sub>       |     2070.2      |     2110.6     |    2315.0     |                      |
-|         RealWorldQA          |      68.0       |      67.5      |     71.8      |                      |
-|     AI2D<sub>test</sub>      |      89.4       |      80.3      |     87.1      |                      |
-|      MMMU<sub>val</sub>      |      63.1       |      58.5      |     53.9      |                      |
-|  MMBench-EN<sub>test</sub>   |      81.0       |      73.9      |     86.8      |                      |
-|  MMBench-CN<sub>test</sub>   |      80.2       |      73.8      |     86.5      |                      |
-|    CCBench<sub>dev</sub>     |      57.3       |      28.4      |     80.6      |                      |
-|  MMVet<sub>GPT-4-0613</sub>  |        -        |       -        |     68.5      |                      |
-| MMVet<sub>GPT-4-Turbo</sub>  |      67.5       |      64.0      |     65.5      |                      |
-|          SEED-Image          |        -        |       -        |     78.2      |                      |
-|   HallBench<sub>avg</sub>    |      43.9       |      45.6      |     56.9      |                      |
-| MathVista<sub>testmini</sub> |      58.1       |      57.7      |     63.7      |                      |
 - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说，DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。MMMU、OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
@@ -404,7 +436,7 @@ InternVL 2.0 是一个多模态大语言模型系列，包含各种规模的模
 | :------------------: | :----: | :------: | :--------------: | :-----------: | :------------------: |
 |       模型大小       |   -    |   34B    |       34B        |      40B      |         76B          |
 |                      |        |          |                  |               |                      |
-|       MVBench        |   -    |    -     |        -         |     72.5      |                      |
 | Video-MME<br>wo subs |  59.9  |   59.0   |       52.0       |     TODO      |         TODO         |
 | Video-MME<br>w/ subs |  63.3  |   59.4   |       54.9       |     TODO      |         TODO         |
@@ -422,6 +454,10 @@ InternVL 2.0 是一个多模态大语言模型系列，包含各种规模的模
 示例代码请[点击这里](#quick-start)。
 ## 部署
 ### LMDeploy

 | :--------------------------: | :-------------: | :------------: | :-----------: | :------------------: |
 |          Model Size          |        -        |       -        |      40B      |         76B          |
 |                              |                 |                |               |                      |
+|    DocVQA<sub>test</sub>     |      87.2       |      86.5      |     93.9      |         94.1         |
 |    ChartQA<sub>test</sub>    |      78.1       |      81.3      |     86.2      |         88.4         |
 |    InfoVQA<sub>test</sub>    |        -        |      72.7      |     78.7      |         82.0         |
 |    TextVQA<sub>val</sub>     |        -        |      73.5      |     83.0      |         84.4         |
+|           OCRBench           |       678       |      754       |      837      |         839          |
 |      MME<sub>sum</sub>       |     2070.2      |     2110.6     |    2315.0     |        2414.7        |
+|         RealWorldQA          |      68.0       |      67.5      |     71.8      |         72.2         |
 |     AI2D<sub>test</sub>      |      89.4       |      80.3      |     87.1      |         87.6         |
+|      MMMU<sub>val</sub>      |   63.1 / 61.7   |  58.5 / 60.6   |  53.9 / 55.2  |     55.2 / 58.2      |
 |  MMBench-EN<sub>test</sub>   |      81.0       |      73.9      |     86.8      |         86.5         |
 |  MMBench-CN<sub>test</sub>   |      80.2       |      73.8      |     86.5      |         86.3         |
 |    CCBench<sub>dev</sub>     |      57.3       |      28.4      |     80.6      |         81.0         |
 |  MMVet<sub>GPT-4-0613</sub>  |        -        |       -        |     68.5      |         69.8         |
+| MMVet<sub>GPT-4-Turbo</sub>  |      67.5       |      64.0      |     65.5      |         65.7         |
 |          SEED-Image          |        -        |       -        |     78.2      |         78.2         |
+|   HallBench<sub>avg</sub>    |      43.9       |      45.6      |     56.9      |         55.2         |
+| MathVista<sub>testmini</sub> |      58.1       |      57.7      |     63.7      |         65.5         |
 - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. MMMU, OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
 | :------------------: | :----: | :------: | :--------------: | :-----------: | :------------------: |
 |      Model Size      |   -    |   34B    |       34B        |      40B      |         76B          |
 |                      |        |          |                  |               |                      |
+|       MVBench        |   -    |    -     |        -         |     72.5      |         69.6         |
 | Video-MME<br>wo subs |  59.9  |   59.0   |       52.0       |     TODO      |         TODO         |
 | Video-MME<br>w/ subs |  63.3  |   59.4   |       54.9       |     TODO      |         TODO         |
 > Please use transformers==4.37.2 to ensure the model works normally.
 ```python
+import math
 import numpy as np
 import torch
 import torchvision.transforms as T
     return pixel_values
+def split_model(model_name):
+    device_map = {}
+    world_size = torch.cuda.device_count()
+    num_layers = {'InternVL2-8B': 32, 'InternVL2-26B': 48,
+                  'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
+    # Since the first GPU will be used for ViT, treat it as half a GPU.
+    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
+    num_layers_per_gpu = [num_layers_per_gpu] * world_size
+    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
+    layer_cnt = 0
+    for i, num_layer in enumerate(num_layers_per_gpu):
+        for j in range(num_layer):
+            device_map[f'language_model.model.layers.{layer_cnt}'] = i
+            layer_cnt += 1
+    device_map['vision_model'] = 0
+    device_map['mlp1'] = 0
+    device_map['language_model.model.tok_embeddings'] = 0
+    device_map['language_model.model.embed_tokens'] = 0
+    device_map['language_model.output'] = 0
+    device_map['language_model.model.norm'] = 0
+    device_map['language_model.lm_head'] = 0
+    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
+    return device_map
 path = 'OpenGVLab/InternVL2-Llama3-76B'
+device_map = split_model('InternVL2-Llama3-76B')
+print(device_map)
+# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
+# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,
+    load_in_8bit=True,
     low_cpu_mem_usage=True,
     trust_remote_code=True,
+    device_map=device_map).eval()
 tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
 # set the max number of tiles in `max_num`
 pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
 print(f'Assistant: {response}')
 ```
+## Finetune
+SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of InternVL, please check [this link](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md) for more details.
 ## Deployment
 ### LMDeploy
 | :--------------------------: | :-------------: | :------------: | :-----------: | :------------------: |
 |           模型大小           |        -        |       -        |      40B      |         76B          |
 |                              |                 |                |               |                      |
+|    DocVQA<sub>test</sub>     |      87.2       |      86.5      |     93.9      |         94.1         |
+|    ChartQA<sub>test</sub>    |      78.1       |      81.3      |     86.2      |         88.4         |
+|    InfoVQA<sub>test</sub>    |        -        |      72.7      |     78.7      |         82.0         |
+|    TextVQA<sub>val</sub>     |        -        |      73.5      |     83.0      |         84.4         |
+|           OCRBench           |       678       |      754       |      837      |         839          |
+|      MME<sub>sum</sub>       |     2070.2      |     2110.6     |    2315.0     |        2414.7        |
+|         RealWorldQA          |      68.0       |      67.5      |     71.8      |         72.2         |
+|     AI2D<sub>test</sub>      |      89.4       |      80.3      |     87.1      |         87.6         |
+|      MMMU<sub>val</sub>      |   63.1 / 61.7   |  58.5 / 60.6   |  53.9 / 55.2  |     55.2 / 58.2      |
+|  MMBench-EN<sub>test</sub>   |      81.0       |      73.9      |     86.8      |         86.5         |
+|  MMBench-CN<sub>test</sub>   |      80.2       |      73.8      |     86.5      |         86.3         |
+|    CCBench<sub>dev</sub>     |      57.3       |      28.4      |     80.6      |         81.0         |
+|  MMVet<sub>GPT-4-0613</sub>  |        -        |       -        |     68.5      |         69.8         |
+| MMVet<sub>GPT-4-Turbo</sub>  |      67.5       |      64.0      |     65.5      |         65.7         |
+|          SEED-Image          |        -        |       -        |     78.2      |         78.2         |
+|   HallBench<sub>avg</sub>    |      43.9       |      45.6      |     56.9      |         55.2         |
+| MathVista<sub>testmini</sub> |      58.1       |      57.7      |     63.7      |         65.5         |
 - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说，DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。MMMU、OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
 | :------------------: | :----: | :------: | :--------------: | :-----------: | :------------------: |
 |       模型大小       |   -    |   34B    |       34B        |      40B      |         76B          |
 |                      |        |          |                  |               |                      |
+|       MVBench        |   -    |    -     |        -         |     72.5      |         69.6         |
 | Video-MME<br>wo subs |  59.9  |   59.0   |       52.0       |     TODO      |         TODO         |
 | Video-MME<br>w/ subs |  63.3  |   59.4   |       54.9       |     TODO      |         TODO         |
 示例代码请[点击这里](#quick-start)。
+## 微调
+来自ModelScope社区的SWIFT已经支持对InternVL进行微调（图像/视频），详情请查看[此链接](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md)。
 ## 部署
 ### LMDeploy