openbmb
/

MiniCPM-V-2_6

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

Use lazy import for flash_attn

#12

by HwwwH - opened Aug 12

base: refs/heads/main

←

from: refs/pr/12

Discussion Files changed

This PR is in draft mode

Files changed (1) hide show

README.md +3 -14

README.md CHANGED Viewed

@@ -1,17 +1,7 @@
 ---
-pipeline_tag: image-text-to-text
 datasets:
 - openbmb/RLAIF-V-Dataset
-library_name: transformers
-language:
-- multilingual
-tags:
-- minicpm-v
-- vision
-- ocr
-- multi-image
-- video
-- custom_code
 ---
 <h1>A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone</h1>
@@ -73,8 +63,7 @@ Note: For proprietary models, we calculate token density based on the image enco
 <summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
 <div align="center">
-<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/_T1mw5yhqNCqVdYRTQOGu.png) -->
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/jmrjoRr8SFLkrstjDmpaV.png)
 </div>
@@ -281,7 +270,7 @@ def encode_video(video_path):
     print('num frames:', len(frames))
     return frames
-video_path ="video_test.mp4"
 frames = encode_video(video_path)
 question = "Describe the video"
 msgs = [

 ---
+pipeline_tag: visual-question-answering
 datasets:
 - openbmb/RLAIF-V-Dataset
 ---
 <h1>A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone</h1>
 <summary>Click to view video results on Video-MME and Video-ChatGPT.</summary>
 <div align="center">
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/_T1mw5yhqNCqVdYRTQOGu.png)
 </div>
     print('num frames:', len(frames))
     return frames
+video_path="video_test.mp4"
 frames = encode_video(video_path)
 question = "Describe the video"
 msgs = [