Post
Google's SigLIP is another alternative to openai's CLIP, and it just got merged to ๐คtransformers and it's super easy to use!
To celebrate this, I have created a repository including notebooks and bunch of Spaces on various SigLIP based projects ๐ฅณ
Search for art ๐ merve/draw_to_search_art
Compare SigLIP with CLIP ๐ merve/compare_clip_siglip
How does SigLIP work?
SigLIP an vision-text pre-training technique based on contrastive learning. It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. ๐
Highlights from the paper on why you should use it โจ
๐ผ๏ธ๐ Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder
๐ More performant than CLIP on zero-shot
๐ฃ๏ธ Authors trained a multilingual model too!
โก๏ธ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k because the performance saturates after that
It's super easy to use thanks to transformers ๐
For all the SigLIP notebooks on similarity search and indexing, you can check this [repository](https://github.com/merveenoyan/siglip) out. ๐ค
To celebrate this, I have created a repository including notebooks and bunch of Spaces on various SigLIP based projects ๐ฅณ
Search for art ๐ merve/draw_to_search_art
Compare SigLIP with CLIP ๐ merve/compare_clip_siglip
How does SigLIP work?
SigLIP an vision-text pre-training technique based on contrastive learning. It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. ๐
Highlights from the paper on why you should use it โจ
๐ผ๏ธ๐ Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder
๐ More performant than CLIP on zero-shot
๐ฃ๏ธ Authors trained a multilingual model too!
โก๏ธ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k because the performance saturates after that
It's super easy to use thanks to transformers ๐
from transformers import pipeline
from PIL import Image
import requests
# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-256-i18n")
# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)
For all the SigLIP notebooks on similarity search and indexing, you can check this [repository](https://github.com/merveenoyan/siglip) out. ๐ค