Model Card for ImageBind

Multimodal joint embedding model for image/video, text, audio, depth, IMU, and thermal images. Input any of the six modalities and get the same sized embedding that can be used for cross-modal and multimodal tasks.

Model Details

Model Description

Multimodal joint embedding model for image/video, text, audio, depth, IMU, and thermal images

Developed by: Meta AI
Model type: Multimodal model
Language(s) (NLP): en
License: CC BY-NC-SA 4.0
Resources for more information:
- GitHub Repo

Uses

This model is intended only for research purposes. It provides a joint embedding space for different modalities -- image/video, text, audio, depth, IMU and thermal images. We hope that these joint embeddings can be used for a variety of different cross-modal research, e.g., cross-modal retrieval and combining embeddings from different modalities.

Out-of-Scope Use

This model is NOT intended to be used in any real world application -- commercial or otherwise. It may produce harmful associations with different inputs. The model needs to be investigated and likely re-trained on specific data for any such application. The model is expected to work better on web-based visual data since it was trained on such data. The text encoder is likely to work only on English language text because of the underlying training datasets.

Bias, Risks, and Limitations

Open-domain joint embedding models are prone to producing specific biases, e.g., study from CLIP. Since our model uses such models as initialization, it will exhibit such biases too. Moreover, for learning joint embeddings for other modalities such as audio, thermal, depth, and IMU we leverage datasets that are relatively small. These joint embeddings are thus limited to the concepts present in the datasets. For example, the thermal datasets we used are limited to outdoor street scenes, while the depth datasets are limited to indoor scenes.

Training Details

Training Data

ImageBind uses image-paired data for training -- (image, X) where X is one of text, audio, depth, IMU or thermal data. In particular, we initialize and freeze the image and text encoders using an OpenCLIP ViT-H encoder. We train audio embeddings using Audioset, depth embeddings using the SUN RGB-D dataset, IMU using the Ego4D dataset and thermal embeddings using the LLVIP dataset. We provide the exact training data details in the paper.

Training Procedure

Please refer to the research paper and github repo for exact details on this.

Evaluation

Testing Data, Factors & Metrics

We evaluate the model on a variety of different classification benchmarks for each modality. The evaluation details are presented in the paper. The models performance is measured using standard classification metrics such as accuracy and mAP.

Citation

BibTeX:

@inproceedings{girdhar2023imagebind,
  title={ImageBind: One Embedding Space To Bind Them All},
  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
  booktitle={CVPR},
  year={2023}
}

Model Card Contact

Please reach out to the authors at: [email protected] [email protected] [email protected]

How to Get Started with the Model

Our github repo provides a simple example to extract embeddings from images, audio etc.