|
--- |
|
pipeline_tag: zero-shot-classification |
|
base_model: laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K |
|
inference: false |
|
tags: |
|
- deepsparse |
|
--- |
|
This is a quantized version of https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with [DeepSparse](https://github.com/neuralmagic/deepsparse). It achieves 71.1% one-shot accuracy on ImageNet. |
|
|
|
## Usage |
|
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZvU9ZSHJKSeJyH5bgxo_A-GSVIUcSt2E?usp=sharing) |
|
First, install DeepSparse with extensions for CLIP: |
|
``` |
|
pip install deepsparse-nightly[clip]>=1.7.0.20231210 |
|
``` |
|
|
|
Download some test images of a church, a dog, and elephants: |
|
``` |
|
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg |
|
wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg |
|
wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg |
|
``` |
|
|
|
For this model there is a second input that is the length of tokens, so run this input override before making the pipeline: |
|
```python |
|
import numpy as np |
|
from deepsparse.clip import CLIPTextPipeline |
|
|
|
def custom_process_inputs(self, inputs): |
|
if not isinstance(inputs.text, list): |
|
inputs.text = [inputs.text] |
|
if not isinstance(inputs.text[0], str): |
|
return inputs.text |
|
tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)] |
|
tokens = np.stack(tokens, axis=0) |
|
tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1]) |
|
return [tokens, tokens_lengths] |
|
|
|
# This overrides the process_inputs function globally for all CLIPTextPipeline classes |
|
CLIPTextPipeline.process_inputs = custom_process_inputs |
|
``` |
|
|
|
Then make and run a pipeline in Python: |
|
```python |
|
from deepsparse import Pipeline |
|
from deepsparse.clip import ( |
|
CLIPTextInput, |
|
CLIPVisualInput, |
|
CLIPZeroShotInput |
|
) |
|
from huggingface_hub import snapshot_download |
|
|
|
# Download the model from HF |
|
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds") |
|
|
|
possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"] |
|
images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"] |
|
|
|
# Load the model into DeepSparse |
|
pipeline = Pipeline.create( |
|
task="clip_zeroshot", |
|
visual_model_path=model_folder + "/visual.onnx", |
|
text_model_path=model_folder + "/textual.onnx" |
|
) |
|
|
|
# Infer |
|
output = pipeline( |
|
image=CLIPVisualInput(images=images), |
|
text=CLIPTextInput(text=possible_classes), |
|
).text_scores |
|
|
|
for i in range(len(output)): |
|
prediction = possible_classes[np.argmax(output[i])] |
|
print(f"Image {images[i]} is a picture of {prediction}") |
|
|
|
""" |
|
Image basilica.jpg is a picture of a church |
|
Image buddy.jpeg is a picture of a dog |
|
Image thailand.jpg is a picture of an elephant |
|
""" |
|
``` |