czczup commited on
Commit
cf1cb63
1 Parent(s): b50544d

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +28 -91
  2. modeling_intern_vit.py +6 -12
README.md CHANGED
@@ -62,6 +62,8 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
62
  | MathVista<sub>testmini</sub> | 28.7 | 44.5 | 53.7 | 58.6 |
63
  | OpenCompass<sub>avg</sub> | 46.6 | 53.6 | 56.2 | 60.6 |
64
 
 
 
65
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
66
 
67
  - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
@@ -300,7 +302,7 @@ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast
300
 
301
  # set the max number of tiles in `max_num`
302
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
303
- generation_config = dict(max_new_tokens=1024, do_sample=False)
304
 
305
  # pure-text conversation (纯文本对话)
306
  question = 'Hello, who are you?'
@@ -452,7 +454,7 @@ for new_text in streamer:
452
 
453
  ## Finetune
454
 
455
- SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of InternVL, please check [this link](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md) for more details.
456
 
457
  ## Deployment
458
 
@@ -461,7 +463,7 @@ SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of I
461
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
462
 
463
  ```sh
464
- pip install lmdeploy
465
  ```
466
 
467
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
@@ -469,16 +471,12 @@ LMDeploy abstracts the complex inference process of multi-modal Vision-Language
469
  #### A 'Hello, world' example
470
 
471
  ```python
472
- from lmdeploy import pipeline, PytorchEngineConfig, ChatTemplateConfig
473
  from lmdeploy.vl import load_image
474
 
475
  model = 'OpenGVLab/InternVL2-4B'
476
- system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
477
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
478
- chat_template_config = ChatTemplateConfig('internvl-phi3')
479
- chat_template_config.meta_instruction = system_prompt
480
- pipe = pipeline(model, chat_template_config=chat_template_config,
481
- backend_config=PytorchEngineConfig(session_len=8192))
482
  response = pipe(('describe this image', image))
483
  print(response.text)
484
  ```
@@ -492,16 +490,12 @@ When dealing with multiple images, you can put them all in one list. Keep in min
492
  > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
493
 
494
  ```python
495
- from lmdeploy import pipeline, PytorchEngineConfig, ChatTemplateConfig
496
  from lmdeploy.vl import load_image
497
  from lmdeploy.vl.constants import IMAGE_TOKEN
498
 
499
  model = 'OpenGVLab/InternVL2-4B'
500
- system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
501
- chat_template_config = ChatTemplateConfig('internvl-phi3')
502
- chat_template_config.meta_instruction = system_prompt
503
- pipe = pipeline(model, chat_template_config=chat_template_config,
504
- backend_config=PytorchEngineConfig(session_len=8192))
505
 
506
  image_urls=[
507
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -519,15 +513,11 @@ print(response.text)
519
  Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
520
 
521
  ```python
522
- from lmdeploy import pipeline, PytorchEngineConfig, ChatTemplateConfig
523
  from lmdeploy.vl import load_image
524
 
525
  model = 'OpenGVLab/InternVL2-4B'
526
- system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
527
- chat_template_config = ChatTemplateConfig('internvl-phi3')
528
- chat_template_config.meta_instruction = system_prompt
529
- pipe = pipeline(model, chat_template_config=chat_template_config,
530
- backend_config=PytorchEngineConfig(session_len=8192))
531
 
532
  image_urls=[
533
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -543,15 +533,11 @@ print(response)
543
  There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
544
 
545
  ```python
546
- from lmdeploy import pipeline, PytorchEngineConfig, ChatTemplateConfig, GenerationConfig
547
  from lmdeploy.vl import load_image
548
 
549
  model = 'OpenGVLab/InternVL2-4B'
550
- system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
551
- chat_template_config = ChatTemplateConfig('internvl-phi3')
552
- chat_template_config.meta_instruction = system_prompt
553
- pipe = pipeline(model, chat_template_config=chat_template_config,
554
- backend_config=PytorchEngineConfig(session_len=8192))
555
 
556
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
557
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
@@ -563,20 +549,10 @@ print(sess.response.text)
563
 
564
  #### Service
565
 
566
- To deploy InternVL2 as an API, please configure the chat template config first. Create the following JSON file `chat_template.json`.
567
-
568
- ```json
569
- {
570
- "model_name":"internlm2-phi3",
571
- "meta_instruction":"我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。",
572
- "stop_words":["<|end|>"]
573
- }
574
- ```
575
-
576
  LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
577
 
578
  ```shell
579
- lmdeploy serve api_server OpenGVLab/InternVL2-4B --backend pytorch --server-port 23333 --chat-template chat_template.json
580
  ```
581
 
582
  To use the OpenAI-style interface, you need to install OpenAI:
@@ -613,14 +589,6 @@ response = client.chat.completions.create(
613
  print(response)
614
  ```
615
 
616
- ### vLLM
617
-
618
- TODO
619
-
620
- ### Ollama
621
-
622
- TODO
623
-
624
  ## License
625
 
626
  This project is released under the MIT license.
@@ -693,6 +661,8 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
693
  | MathVista<sub>testmini</sub> | 28.7 | 44.5 | 53.7 | 58.6 |
694
  | OpenCompass<sub>avg</sub> | 46.6 | 53.6 | 56.2 | 60.6 |
695
 
 
 
696
  - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
697
 
698
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
@@ -751,7 +721,7 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
751
 
752
  ## 微调
753
 
754
- 来自ModelScope社区的SWIFT已经支持对InternVL进行微调(图像/视频),详情请查看[此链接](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md)
755
 
756
  ## 部署
757
 
@@ -760,7 +730,7 @@ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模
760
  LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型(LLM)的工具包。
761
 
762
  ```sh
763
- pip install lmdeploy
764
  ```
765
 
766
  LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为一个易于使用的管道,类似于大语言模型(LLM���的推理管道。
@@ -768,16 +738,12 @@ LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为
768
  #### 一个“你好,世界”示例
769
 
770
  ```python
771
- from lmdeploy import pipeline, PytorchEngineConfig, ChatTemplateConfig
772
  from lmdeploy.vl import load_image
773
 
774
  model = 'OpenGVLab/InternVL2-4B'
775
- system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
776
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
777
- chat_template_config = ChatTemplateConfig('internvl-phi3')
778
- chat_template_config.meta_instruction = system_prompt
779
- pipe = pipeline(model, chat_template_config=chat_template_config,
780
- backend_config=PytorchEngineConfig(session_len=8192))
781
  response = pipe(('describe this image', image))
782
  print(response.text)
783
  ```
@@ -789,16 +755,12 @@ print(response.text)
789
  在处理多张图像时,可以将它们全部放入一个列表中。请注意,多张图像会导致输入 token 数量增加,因此通常需要增加上下文窗口的大小。
790
 
791
  ```python
792
- from lmdeploy import pipeline, PytorchEngineConfig, ChatTemplateConfig
793
  from lmdeploy.vl import load_image
794
  from lmdeploy.vl.constants import IMAGE_TOKEN
795
 
796
  model = 'OpenGVLab/InternVL2-4B'
797
- system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
798
- chat_template_config = ChatTemplateConfig('internvl-phi3')
799
- chat_template_config.meta_instruction = system_prompt
800
- pipe = pipeline(model, chat_template_config=chat_template_config,
801
- backend_config=PytorchEngineConfig(session_len=8192))
802
 
803
  image_urls=[
804
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -806,6 +768,7 @@ image_urls=[
806
  ]
807
 
808
  images = [load_image(img_url) for img_url in image_urls]
 
809
  response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
810
  print(response.text)
811
  ```
@@ -815,15 +778,11 @@ print(response.text)
815
  使用批量Prompt进行推理非常简单;只需将它们放在一个列表结构中:
816
 
817
  ```python
818
- from lmdeploy import pipeline, PytorchEngineConfig, ChatTemplateConfig
819
  from lmdeploy.vl import load_image
820
 
821
  model = 'OpenGVLab/InternVL2-4B'
822
- system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
823
- chat_template_config = ChatTemplateConfig('internvl-phi3')
824
- chat_template_config.meta_instruction = system_prompt
825
- pipe = pipeline(model, chat_template_config=chat_template_config,
826
- backend_config=PytorchEngineConfig(session_len=8192))
827
 
828
  image_urls=[
829
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -839,15 +798,11 @@ print(response)
839
  使用管道进行多轮对话有两种方法。一种是根据 OpenAI 的格式构建消息并使用上述方法,另一种是使用 `pipeline.chat` 接口。
840
 
841
  ```python
842
- from lmdeploy import pipeline, PytorchEngineConfig, ChatTemplateConfig, GenerationConfig
843
  from lmdeploy.vl import load_image
844
 
845
  model = 'OpenGVLab/InternVL2-4B'
846
- system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
847
- chat_template_config = ChatTemplateConfig('internvl-phi3')
848
- chat_template_config.meta_instruction = system_prompt
849
- pipe = pipeline(model, chat_template_config=chat_template_config,
850
- backend_config=PytorchEngineConfig(session_len=8192))
851
 
852
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
853
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
@@ -859,20 +814,10 @@ print(sess.response.text)
859
 
860
  #### API部署
861
 
862
- 为了将InternVL2部署成API,请先配置聊天模板配置文件。创建如下的 JSON 文件 `chat_template.json`。
863
-
864
- ```json
865
- {
866
- "model_name":"internlm2-phi3",
867
- "meta_instruction":"我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。",
868
- "stop_words":["<|end|>"]
869
- }
870
- ```
871
-
872
  LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。��供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例:
873
 
874
  ```shell
875
- lmdeploy serve api_server OpenGVLab/InternVL2-4B --backend pytorch --server-port 23333 --chat-template chat_template.json
876
  ```
877
 
878
  为了使用OpenAI风格的API接口,您需要安装OpenAI:
@@ -909,14 +854,6 @@ response = client.chat.completions.create(
909
  print(response)
910
  ```
911
 
912
- ### vLLM
913
-
914
- TODO
915
-
916
- ### Ollama
917
-
918
- TODO
919
-
920
  ## 开源许可证
921
 
922
  该项目采用 MIT 许可证发布。
 
62
  | MathVista<sub>testmini</sub> | 28.7 | 44.5 | 53.7 | 58.6 |
63
  | OpenCompass<sub>avg</sub> | 46.6 | 53.6 | 56.2 | 60.6 |
64
 
65
+ - For more details and evaluation reproduction, please refer to our [Evaluation Guide](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html).
66
+
67
  - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
68
 
69
  - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
 
302
 
303
  # set the max number of tiles in `max_num`
304
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
305
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
306
 
307
  # pure-text conversation (纯文本对话)
308
  question = 'Hello, who are you?'
 
454
 
455
  ## Finetune
456
 
457
+ Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
458
 
459
  ## Deployment
460
 
 
463
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
464
 
465
  ```sh
466
+ pip install lmdeploy==0.5.3
467
  ```
468
 
469
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
 
471
  #### A 'Hello, world' example
472
 
473
  ```python
474
+ from lmdeploy import pipeline, TurbomindEngineConfig
475
  from lmdeploy.vl import load_image
476
 
477
  model = 'OpenGVLab/InternVL2-4B'
 
478
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
479
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
480
  response = pipe(('describe this image', image))
481
  print(response.text)
482
  ```
 
490
  > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
491
 
492
  ```python
493
+ from lmdeploy import pipeline, TurbomindEngineConfig
494
  from lmdeploy.vl import load_image
495
  from lmdeploy.vl.constants import IMAGE_TOKEN
496
 
497
  model = 'OpenGVLab/InternVL2-4B'
498
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
499
 
500
  image_urls=[
501
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 
513
  Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
514
 
515
  ```python
516
+ from lmdeploy import pipeline, TurbomindEngineConfig
517
  from lmdeploy.vl import load_image
518
 
519
  model = 'OpenGVLab/InternVL2-4B'
520
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
521
 
522
  image_urls=[
523
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 
533
  There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
534
 
535
  ```python
536
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
537
  from lmdeploy.vl import load_image
538
 
539
  model = 'OpenGVLab/InternVL2-4B'
540
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
541
 
542
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
543
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 
549
 
550
  #### Service
551
 
 
 
 
 
 
 
 
 
 
 
552
  LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
553
 
554
  ```shell
555
+ lmdeploy serve api_server OpenGVLab/InternVL2-4B --backend turbomind --server-port 23333
556
  ```
557
 
558
  To use the OpenAI-style interface, you need to install OpenAI:
 
589
  print(response)
590
  ```
591
 
 
 
 
 
 
 
 
 
592
  ## License
593
 
594
  This project is released under the MIT license.
 
661
  | MathVista<sub>testmini</sub> | 28.7 | 44.5 | 53.7 | 58.6 |
662
  | OpenCompass<sub>avg</sub> | 46.6 | 53.6 | 56.2 | 60.6 |
663
 
664
+ - 关于更多的细节以及评测复现,请看我们的[评测指南](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html)。
665
+
666
  - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
667
 
668
  - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
 
721
 
722
  ## 微调
723
 
724
+ 许多仓库现在都支持 InternVL 系列模型的微调,包括 [InternVL](https://github.com/OpenGVLab/InternVL)、[SWIFT](https://github.com/modelscope/ms-swift)、[XTurner](https://github.com/InternLM/xtuner) 等。请参阅它们的文档以获取更多微调细节。
725
 
726
  ## 部署
727
 
 
730
  LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型(LLM)的工具包。
731
 
732
  ```sh
733
+ pip install lmdeploy==0.5.3
734
  ```
735
 
736
  LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为一个易于使用的管道,类似于大语言模型(LLM���的推理管道。
 
738
  #### 一个“你好,世界”示例
739
 
740
  ```python
741
+ from lmdeploy import pipeline, TurbomindEngineConfig
742
  from lmdeploy.vl import load_image
743
 
744
  model = 'OpenGVLab/InternVL2-4B'
 
745
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
746
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
747
  response = pipe(('describe this image', image))
748
  print(response.text)
749
  ```
 
755
  在处理多张图像时,可以将它们全部放入一个列表中。请注意,多张图像会导致输入 token 数量增加,因此通常需要增加上下文窗口的大小。
756
 
757
  ```python
758
+ from lmdeploy import pipeline, TurbomindEngineConfig
759
  from lmdeploy.vl import load_image
760
  from lmdeploy.vl.constants import IMAGE_TOKEN
761
 
762
  model = 'OpenGVLab/InternVL2-4B'
763
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
764
 
765
  image_urls=[
766
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 
768
  ]
769
 
770
  images = [load_image(img_url) for img_url in image_urls]
771
+ # Numbering images improves multi-image conversations
772
  response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
773
  print(response.text)
774
  ```
 
778
  使用批量Prompt进行推理非常简单;只需将它们放在一个列表结构中:
779
 
780
  ```python
781
+ from lmdeploy import pipeline, TurbomindEngineConfig
782
  from lmdeploy.vl import load_image
783
 
784
  model = 'OpenGVLab/InternVL2-4B'
785
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
786
 
787
  image_urls=[
788
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 
798
  使用管道进行多轮对话有两种方法。一种是根据 OpenAI 的格式构建消息并使用上述方法,另一种是使用 `pipeline.chat` 接口。
799
 
800
  ```python
801
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
802
  from lmdeploy.vl import load_image
803
 
804
  model = 'OpenGVLab/InternVL2-4B'
805
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
 
 
806
 
807
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
808
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 
814
 
815
  #### API部署
816
 
 
 
 
 
 
 
 
 
 
 
817
  LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。��供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例:
818
 
819
  ```shell
820
+ lmdeploy serve api_server OpenGVLab/InternVL2-4B --backend turbomind --server-port 23333
821
  ```
822
 
823
  为了使用OpenAI风格的API接口,您需要安装OpenAI:
 
854
  print(response)
855
  ```
856
 
 
 
 
 
 
 
 
 
857
  ## 开源许可证
858
 
859
  该项目采用 MIT 许可证发布。
modeling_intern_vit.py CHANGED
@@ -20,18 +20,12 @@ from transformers.utils import logging
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
23
- try: # v1
24
- from flash_attn.flash_attn_interface import \
25
- flash_attn_unpadded_qkvpacked_func
26
- except: # v2
27
- from flash_attn.flash_attn_interface import \
28
- flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
-
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
-
 
32
  has_flash_attn = True
33
  except:
34
- print('FlashAttention is not installed.')
35
  has_flash_attn = False
36
 
37
  logger = logging.get_logger(__name__)
@@ -74,7 +68,7 @@ class FlashAttention(nn.Module):
74
  max_s = seqlen
75
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
76
  device=qkv.device)
77
- output = flash_attn_unpadded_qkvpacked_func(
78
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
79
  softmax_scale=self.softmax_scale, causal=causal
80
  )
@@ -84,7 +78,7 @@ class FlashAttention(nn.Module):
84
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
85
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
86
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
87
- output_unpad = flash_attn_unpadded_qkvpacked_func(
88
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
89
  softmax_scale=self.softmax_scale, causal=causal
90
  )
@@ -93,7 +87,7 @@ class FlashAttention(nn.Module):
93
  'b s (h d) -> b s h d', h=nheads)
94
  else:
95
  assert max_s is not None
96
- output = flash_attn_unpadded_qkvpacked_func(
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
 
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
 
 
 
 
 
 
 
23
  from flash_attn.bert_padding import pad_input, unpad_input
24
+ from flash_attn.flash_attn_interface import \
25
+ flash_attn_varlen_qkvpacked_func
26
  has_flash_attn = True
27
  except:
28
+ print('FlashAttention2 is not installed.')
29
  has_flash_attn = False
30
 
31
  logger = logging.get_logger(__name__)
 
68
  max_s = seqlen
69
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
70
  device=qkv.device)
71
+ output = flash_attn_varlen_qkvpacked_func(
72
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
73
  softmax_scale=self.softmax_scale, causal=causal
74
  )
 
78
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
79
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
80
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
81
+ output_unpad = flash_attn_varlen_qkvpacked_func(
82
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
83
  softmax_scale=self.softmax_scale, causal=causal
84
  )
 
87
  'b s (h d) -> b s h d', h=nheads)
88
  else:
89
  assert max_s is not None
90
+ output = flash_attn_varlen_qkvpacked_func(
91
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
92
  softmax_scale=self.softmax_scale, causal=causal
93
  )