mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
Abstract
Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.
Community
In this work, to address these challenges, we propose a High-resolution
DocCompressor module to compress each high-resolution document image into
324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and
balance both token efficiency and question-answering performance, we develop
the DocOwl2 under a three-stage training framework: Single-image Pretraining,
Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a
new state-of-the-art across multi-page document understanding benchmarks and
reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages,
and cross-page structure understanding. A
@xhyandwyy
is it intentional that the links to the mPlug-DocOwl2 model on the github page are empty? If yes, when can we expect the release? Looks like great work
our models will be released around 14th Sep~
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models (2024)
- LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models (2024)
- Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding (2024)
- HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments (2024)
- FlexAttention for Efficient High-Resolution Vision-Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper