How to pass CLIP image embeddings to BLIP2 for captioning?

#19
by potsu-potsu - opened

Hi, I want to pass CLIP image embeddings (1x768 or 257x768) to BLIP-2 to generate captions and I’m wondering if this can be done through diffusers or other means.

Any help would be greatly appreciated.

Hi,

Note that the BLIP-2 models (like the one of this repository) assume that the CLIP model being used is a very specific one, namely an EVA-CLIP one with 39 layers in its vision encoder as seen here. If you will pass embeddings from a different CLIP model, then the output will be random. You could pass them by replacing this line by your custom embeddings (so this would require forking the library and passing them there). Alternatively, we could also add an image_embeds argument to the forward method of Blip2ForConditionalGeneration such that you can easily pass them. Could you open an issue on the Transformers library for that?

Hi, I want to pass CLIP image embeddings (1x768 or 257x768) to BLIP-2 to generate captions and I’m wondering if this can be done through diffusers or other means.

Any help would be greatly appreciated.

Hello!
Have you solved this problem? I also want to embed CLIP images and pass them on to BLIP2 for image captioning tasks. I hope to get some solutions from you and look forward to your reply.
Good luck to you!

Hi @shams123321 ,

I opted to use the lavis library instead of transformers and essentially replaced this line in the generate method of the Blip2OPTclass with the embeddings that I passed to the method. I also used the pre-trained Blip2's associated preprocessor and image encoder to get my image embeddings.

I hope this helps.

Sign up or log in to comment