VT5 base fine-tuned on SP-DocVQA

This is VT5 base fine-tuned on Single-Page DocVQA (SP-DocVQA) dataset using the MP-DocVQA framework. VT5 is a version of the Hi-VT5 described in MP-DocVQA paper, arranged in a non-hierarchical paradigm (using only one page for each question-answer pair). Before fine-tuning, we start from pre-trained t5-base for the language backbone, and pre-trained DiT-base to embed visual features (which we keep frozen during fine-tune phase).

Please, note that VT5 is not integrated into Hugginface, and therefore you must use the MP-DocVQA framework (WIP) or PFL-DocVQA competition framework to use it.

This method is the base architecture for the PFL-DocVQA Competition that will will take place from the 1st of July to the 1st of November, 2023. If you are interested in Federated Learning and Differential Privacy we invite you to have a look at the PFL-DocVQA Challenge and Competition hold on these topics.