(BEiT-3) Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks

Official PyTorch implementation and pretrained models of BEiT-3.

The code and pretrained models of BEiT can be found at here.

The code and pretrained models of BEiT v2 can be found at here.

March, 2023: release the code and pretrained models of BEiT-3
March, 2023: BEiT-3 was accepted by CVPR 2023.
Sept 2022: release the code and pretrained models of BEiT v2
Aug 2022: release preprint Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Aug 2022: release preprint BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
June 2022: release preprint VL-BEiT: Generative Vision-Language Pretraining
March, 2022: add linear probe examples
January, 2022: BEiT was accepted by ICLR 2022 as Oral presentation (54 out of 3391).
August 2021: BEiT is on HuggingFace
July 2021: BEiT-large achieves state-of-the-art results on ADE20K (a big jump to 57.0 mIoU) for semantic segmentation.
July 2021: BEiT-large achieves state-of-the-art ImageNet top-1 accuracy (88.6%) under the setting without extra data other than ImageNet-22k.
July 2021: release the code and pretrained models of BEiT
June 2021: release preprint BEiT: BERT Pre-Training of Image Transformers

Pretrained models

We provide BEiT-3 weights pretrained on monomodal and multimodal data. Our large-size model outperforms previous large-size models across various vision-language and vision downstream tasks. The models were pretrained with 224x224 resolution.

Tips

For vision-language tasks that require deep fusion, we recommend using BEiT3-base and BEiT3-large.
For image-text retrieval or vision tasks, using BEiT3-base-itc and BEiT3-large-itc usually achieve better performance.

Download Checkpoints

Models pretrained on ImageNet-21k images, 160 GB text documents, and web-scale image-text pairs (collected from LAION-400M, English LAION-2B, COYO-700M, and CC15M).
- BEiT3-base: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #parameters: 276M
- BEiT3-large: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #parameters: 746M
Perform image-text contrastive intermediate tuning on BEiT3-base and BEiT3-large.
- BEiT3-base-itc: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #parameters: 222M
- BEiT3-large-itc: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #parameters: 674M
Add indomain image-text pairs (COCO and VG) to continue training BEiT3-base and BEiT3-large using masked data modeling. The indomain models achieve better performance on VQAv2 and NLVR2 tasks.
- BEiT3-base-indomain: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #parameters: 276M
- BEiT3-large-indomain: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #parameters: 746M

Text Tokenizer

beit3.spm is the sentencepiece model used for tokenizing texts.

from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")

Architecture

We use Magneto with decoupled Multiway Transformer as the backbone architecture. Magneto can have better training stability and obtain better performance across modalities (such as vision, and language). The implementation is based on the torchscale package.

Setup

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel bash

Clone the repo and install required packages:

git clone https://github.com/microsoft/unilm.git
cd unilm/beit3
pip install -r requirements.txt

Fine-tuning on ImageNet-1k (Image Classification)

The detailed instructions can be found at get_started_for_image_classification.md. We only use vision-related parameters for image classification fine-tuning.

initialized checkpoint	resolution	acc@1	acc@5	#params	weight
beit3_base_patch16_224	224x224	85.4	97.6	87M	link
beit3_base_indomain_patch16_224	224x224	85.4	97.6	87M	link
beit3_large_patch16_224	224x224	87.6	98.3	305M	link
beit3_large_indomain_patch16_224	224x224	87.5	98.3	305M	link

Fine-tuning on VQAv2 (Visual Question Answering)

The detailed instructions can be found at get_started_for_vqav2.md.

initialized checkpoint	resolution	augmented data	test-dev	test-std	#params	weight
beit3_base_patch16_224	480x480	-	77.65	-	228M	link
beit3_base_indomain_patch16_224	480x480	-	78.46	-	228M	link
beit3_large_patch16_224	480x480	-	81.85	-	683M	link
beit3_large_indomain_patch16_224	480x480	-	82.53	-	683M	link
beit3_large_indomain_patch16_224	768x768	VGQA	82.97	83.03	684M	link

Fine-tuning on NLVR2 (Visual Reasoning)

The detailed instructions can be found at get_started_for_nlvr2.md.

initialized checkpoint	resolution	dev	test-P	#params	weight
beit3_base_patch16_224	224x224	83.6	84.4	226M	link
beit3_base_indomain_patch16_224	224x224	84.6	85.3	226M	link
beit3_large_patch16_224	224x224	88.5	89.4	681M	link
beit3_large_indomain_patch16_224	224x224	89.2	90.0	681M	link

Fine-tuning on COCO Captioning and NoCaps (Image Captioning)

The detailed instructions can be found at get_started_for_image_captioning.md.

COCO Captioning

initialized checkpoint	resolution	test CIDEr	#params	weight
beit3_base_patch16_224	480x480	133.6	271M	link
beit3_base_indomain_patch16_224	480x480	135.0	271M	link
beit3_large_patch16_224	480x480	143.2	739M	link

NoCaps

initialized checkpoint	resolution	val CIDEr	#params	weight
beit3_base_patch16_224	480x480	104.4	271M	link
beit3_base_indomain_patch16_224	480x480	105.6	271M	link
beit3_large_patch16_224	480x480	120.2	739M	link

Fine-tuning on COCO and Flickr30k Retrieval (Image-Text Retrieval)

The detailed instructions can be found at get_started_for_retrieval.md.

COCO Retrieval

initialized checkpoint	resolution	IR@1	TR@1	#params	weight
beit3_base_itc_patch16_224	384x384	61.4	79.1	222M	link
beit3_large_itc_patch16_224	384x384	63.4	82.1	675M	link

Flickr30k Retrieval

initialized checkpoint	resolution	IR@1	TR@1	#params	weight
beit3_base_itc_patch16_224	384x384	86.2	96.3	222M	link
beit3_large_itc_patch16_224	384x384	88.1	97.2	675M	link

Citation

If you find this repository useful, please consider citing our work:

@inproceedings{beit3,
title={Image as a foreign language: {BEiT} pretraining for vision and vision-language tasks},
author={Wenhui Wang and Hangbo Bao and Li Dong and Johan Bjorck and Zhiliang Peng and Qiang Liu and Kriti Aggarwal and Owais Khan Mohammed and Saksham Singhal and Subhojit Som and Furu Wei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}

@article{beitv2,
title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
year={2022},
eprint={2208.06366},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

@inproceedings{beit,
title={{BEiT}: {BERT} Pre-Training of Image Transformers},
author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=p-BhZSz59o4}
}

Acknowledgement

This repository is built using the BEiT, the BEiTv2, the CLIP, the open_clip, the Oscar, the DeiT, the Dino repository and the timm library.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using BEiT-3 models, please submit a GitHub issue.