|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
|
|
**VLE** (**V**isual-**L**anguage **E**ncoder) is an image-text multimodal understanding model built on the pre-trained text and image encoders. |
|
It can be used for multimodal discriminative tasks such as visual question answering and image-text retrieval. |
|
Especially on the visual commonsense reasoning (VCR) task, which requires high-level language understanding and reasoning skills, VLE achieves significant improvements. |
|
|
|
For more details see [https://github.com/iflytek/VLE](https://github.com/iflytek/VLE). |
|
|
|
Online VLE demo on Visual Question Answering: [https://huggingface.co/spaces/hfl/VQA_VLE_LLM](https://huggingface.co/spaces/hfl/VQA_VLE_LLM) |
|
|