Model card for ViT-SO400M-14-SigLIP-384
A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.
This model has been converted from Open-CLIP : timm/ViT-SO400M-14-SigLIP-384 to huggingface CLIPVisionModel
from transformers import CLIPVisionModel, CLIPImageProcessor
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=image, return_tensors="pt", padding=True)
vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-SO400M-14-SigLIP-384-hf')
outputs = vision_tower(**inputs)
logits_per_image = outputs.pooler_output # this is the image-text similarity score
There's still a slight difference where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.
- Downloads last month
- 1,039
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.