BAAI/DIVA · Hugging Face

Diffusion Feedback Helps CLIP See Better

Wenxuan Wang^1,2,3*, Quan Sun^3*, Fan Zhang³, Yepeng Tang⁴, Jing Liu^1,2, Xinlong Wang³

¹CASIA, ²UCAS, ³BAAI, ⁴BJTU
^* Equal Contribution

In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.

Model Zoo

Method	Image Size	Params (M)	Average Score
OpenAI ViT-L-14	224²	427.6	25.9 (+6.6)
OpenAI ViT-L-14	336²	427.9	25.2 (+5.2)
MetaCLIP ViT-L-14	224²	427.6	27.4 (+3.7)
MetaCLIP ViT-H-14	224²	986.1	31.9 (+6.7)
SigLIP ViT-SO-14	224²	877.4	40.7 (+2.9)
SigLIP ViT-SO-14	384²	878.0	38.5 (+1.5)
DFN ViT-H-14	224²	986.1	43.7 (+4.4)
DFN ViT-H-14	378²	986.7	37.8 (+3.0)

📝 Citation

If you find DIVA is helpful for your research, please consider citing📝our paper and give us a github star⭐:

@article{wang2024diffusion,
      title={Diffusion Feedback Helps CLIP See Better},
      author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
      journal={arXiv preprint arXiv:2407.20171},
      year={2024}
}