czczup commited on
Commit
58ffd82
1 Parent(s): 465a220

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +14 -5
README.md CHANGED
@@ -11,7 +11,7 @@ pipeline_tag: visual-question-answering
11
 
12
  ## Introduction
13
 
14
- We are excited to announce the release of InternVL 2.0, the latest addition to the InternVL series of multimodal large language models. InternVL 2.0 features a variety of instruction-tuned models, ranging from 2 billion to 108 billion parameters. This repository contains the instruction-tuned InternVL2-4B model.
15
 
16
  Compared to the state-of-the-art open-source multimodal large language models, InternVL 2.0 surpasses most open-source models. It demonstrates competitive performance on par with proprietary commercial models across various capabilities, including document and chart comprehension, infographics QA, scene text understanding and OCR tasks, scientific and mathematical problem solving, as well as cultural understanding and integrated multimodal capabilities.
17
 
@@ -23,6 +23,8 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
23
 
24
  ## Performance
25
 
 
 
26
  | Benchmark | PaliGemma-3B | Phi-3-Vision | Mini-InternVL-4B-1.5 | InternVL2-4B |
27
  | :--------------------------: | :----------: | :----------: | :------------------: | :----------: |
28
  | Model Size | 2.9B | 4.2B | 4.2B | 4.2B |
@@ -53,6 +55,10 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
53
 
54
  Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
55
 
 
 
 
 
56
  ## Quick Start
57
 
58
  We provide an example code to run InternVL2-4B using `transformers`.
@@ -261,9 +267,10 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
261
 
262
  video_path = './examples/red-panda.mp4'
263
  # pixel_values, num_patches_list = load_video(video_path, num_segments=32, max_num=1)
264
- pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=2)
265
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
266
- video_prefix = '\n'.join([f'Frame{i+1}: <image>' for i in range(len(num_patches_list))]) + '\n'question = video_prefix + 'What is the red panda doing?'
 
267
  # Frame1: <image>\nFrame2: <image>\n...\nFrame31: <image>\n{question}
268
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
269
  num_patches_list=num_patches_list,
@@ -291,7 +298,7 @@ pip install lmdeploy
291
 
292
  You can run batch inference locally with the following python code:
293
 
294
- > This model is not yet supported by LMDeploy.
295
 
296
  ```python
297
  from lmdeploy.vl import load_image
@@ -332,7 +339,7 @@ If you find this project useful in your research, please consider citing:
332
 
333
  ## 简介
334
 
335
- 我们很高兴宣布 InternVL 2.0 的发布,这是 InternVL 系列多模态大语言模型的最新版本。InternVL 2.0 提供了多种指令微调的模型,参数从 20 亿到 1080 亿不等。此仓库包含经过指令微调的 InternVL2-4B 模型。
336
 
337
  与最先进的开源多模态大语言模型相比,InternVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力,包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决,以及文化理解和综合多模态能力。
338
 
@@ -344,6 +351,8 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
344
 
345
  ## 性能测试
346
 
 
 
347
  | 评测数据集 | PaliGemma-3B | Phi-3-Vision | Mini-InternVL-4B-1.5 | InternVL2-4B |
348
  | :--------------------------: | :----------: | :----------: | :------------------: | :----------: |
349
  | 模型大小 | 2.9B | 4.2B | 4.2B | 4.2B |
 
11
 
12
  ## Introduction
13
 
14
+ We are excited to announce the release of InternVL 2.0, the latest addition to the InternVL series of multimodal large language models. InternVL 2.0 features a variety of **instruction-tuned models**, ranging from 2 billion to 108 billion parameters. This repository contains the instruction-tuned InternVL2-4B model.
15
 
16
  Compared to the state-of-the-art open-source multimodal large language models, InternVL 2.0 surpasses most open-source models. It demonstrates competitive performance on par with proprietary commercial models across various capabilities, including document and chart comprehension, infographics QA, scene text understanding and OCR tasks, scientific and mathematical problem solving, as well as cultural understanding and integrated multimodal capabilities.
17
 
 
23
 
24
  ## Performance
25
 
26
+ ### Image Benchmarks
27
+
28
  | Benchmark | PaliGemma-3B | Phi-3-Vision | Mini-InternVL-4B-1.5 | InternVL2-4B |
29
  | :--------------------------: | :----------: | :----------: | :------------------: | :----------: |
30
  | Model Size | 2.9B | 4.2B | 4.2B | 4.2B |
 
55
 
56
  Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
57
 
58
+ ### Video Benchmarks
59
+
60
+ TBD
61
+
62
  ## Quick Start
63
 
64
  We provide an example code to run InternVL2-4B using `transformers`.
 
267
 
268
  video_path = './examples/red-panda.mp4'
269
  # pixel_values, num_patches_list = load_video(video_path, num_segments=32, max_num=1)
270
+ pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
271
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
272
+ video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
273
+ question = video_prefix + 'What is the red panda doing?'
274
  # Frame1: <image>\nFrame2: <image>\n...\nFrame31: <image>\n{question}
275
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
276
  num_patches_list=num_patches_list,
 
298
 
299
  You can run batch inference locally with the following python code:
300
 
301
+ > Warning: This model is not yet supported by LMDeploy.
302
 
303
  ```python
304
  from lmdeploy.vl import load_image
 
339
 
340
  ## 简介
341
 
342
+ 我们很高兴宣布 InternVL 2.0 的发布,这是 InternVL 系列多模态大语言模型的最新版本。InternVL 2.0 提供了多种**指令微调**的模型,参数从 20 亿到 1080 亿不等。此仓库包含经过指令微调的 InternVL2-4B 模型。
343
 
344
  与最先进的开源多模态大语言模型相比,InternVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力,包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决,以及文化理解和综合多模态能力。
345
 
 
351
 
352
  ## 性能测试
353
 
354
+ ### 图像相关评测
355
+
356
  | 评测数据集 | PaliGemma-3B | Phi-3-Vision | Mini-InternVL-4B-1.5 | InternVL2-4B |
357
  | :--------------------------: | :----------: | :----------: | :------------------: | :----------: |
358
  | 模型大小 | 2.9B | 4.2B | 4.2B | 4.2B |