microsoft
/

OmniParser

Image-Text-to-Text

visual-question-answering

Inference Endpoints

Model card Files Files and versions Community

adamlu1 commited on 5 days ago

Commit

7f4732c

•

1 Parent(s): e106ecb

update demo

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ library_name: transformers
 license: mit
 pipeline_tag: image-text-to-text
 ---
-📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/)]
 # Model Summary
 OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent.
@@ -21,4 +21,7 @@ This model hub includes a finetuned version of YOLOv8 and a finetuned BLIP-2 mod
 - While OmniParser only converts screenshot image into texts, it can be used to construct an GUI agent based on LLMs that is actionable. When developing and operating the agent using OmniParser, the developers need to be responsible and follow common safety standard.
 - For OmniPaser-BLIP2, it may incorrectly infer the gender or other sensitive attribute (e.g., race, religion etc.) of individuals in icon images. Inference of sensitive attributes may rely upon stereotypes and generalizations rather than information about specific individuals and are more likely to be incorrect for marginalized people. Incorrect inferences may result in significant physical or psychological injury or restrict, infringe upon or undermine the ability to realize an individual’s human rights. We do not recommend use of OmniParser in any workplace-like use case scenario.

 license: mit
 pipeline_tag: image-text-to-text
 ---
+📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/)] [[Demo](https://huggingface.co/spaces/microsoft/OmniParser/)]
 # Model Summary
 OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent.
 - While OmniParser only converts screenshot image into texts, it can be used to construct an GUI agent based on LLMs that is actionable. When developing and operating the agent using OmniParser, the developers need to be responsible and follow common safety standard.
 - For OmniPaser-BLIP2, it may incorrectly infer the gender or other sensitive attribute (e.g., race, religion etc.) of individuals in icon images. Inference of sensitive attributes may rely upon stereotypes and generalizations rather than information about specific individuals and are more likely to be incorrect for marginalized people. Incorrect inferences may result in significant physical or psychological injury or restrict, infringe upon or undermine the ability to realize an individual’s human rights. We do not recommend use of OmniParser in any workplace-like use case scenario.
+# License
+Please note that icon_detect model is under AGPL license, and icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model.