Abstract
Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.
Community
OCR-2.0 era is coming.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding (2024)
- Decoder Pre-Training with only Text for Scene Text Recognition (2024)
- Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models (2024)
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model (2024)
- Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Wow congrats!
The outputs is in Latex right ? Are there alternatives options?
Best.
Looks interesting
what all languages does it support
You've linked the wrong account. The Lingyu Kong you're currently linked to, which is me, was not involved in this paper's work... But anyway it is an interesting work.
Models citing this paper 13
Browse 13 models citing this paperDatasets citing this paper 0
No dataset linking this paper