File size: 2,975 Bytes
6b438fd
 
 
 
 
 
dcca9b3
b198789
 
dcca9b3
b198789
 
 
 
 
 
 
 
 
 
 
 
43517d5
b198789
 
43517d5
b198789
43517d5
b198789
 
 
 
 
 
 
 
 
43517d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b198789
 
 
 
43517d5
 
 
 
 
 
b198789
81bfb1c
43517d5
b198789
 
43517d5
b198789
 
 
 
 
 
 
 
 
 
6b438fd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
pipeline_tag: zero-shot-classification
base_model: laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K
tags:
- deepsparse
---
This is a quantized version of https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with [DeepSparse](https://github.com/neuralmagic/deepsparse). It achieves 71.1% one-shot accuracy on ImageNet.

## Usage
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZvU9ZSHJKSeJyH5bgxo_A-GSVIUcSt2E?usp=sharing)
First, install DeepSparse with extensions for CLIP:
```
pip install deepsparse-nightly[clip]>=1.7.0.20231210
```

Download some test images of a church, a dog, and elephants:
```
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg
wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg
```

For this model there is a second input that is the length of tokens, so run this input override before making the pipeline:
```python
import numpy as np
from deepsparse.clip import CLIPTextPipeline

def custom_process_inputs(self, inputs):
    if not isinstance(inputs.text, list):
        inputs.text = [inputs.text]
    if not isinstance(inputs.text[0], str):
        return inputs.text
    tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)]
    tokens = np.stack(tokens, axis=0)
    tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1])
    return [tokens, tokens_lengths]

# This overrides the process_inputs function globally for all CLIPTextPipeline classes
CLIPTextPipeline.process_inputs = custom_process_inputs
```

Then make and run a pipeline in Python:
```python
from deepsparse import Pipeline
from deepsparse.clip import (
    CLIPTextInput,
    CLIPVisualInput,
    CLIPZeroShotInput
)
from huggingface_hub import snapshot_download

# Download the model from HF
model_folder = snapshot_download(repo_id="mgoin/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"]
images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

# Load the model into DeepSparse
pipeline = Pipeline.create(
    task="clip_zeroshot", 
    visual_model_path=model_folder + "/visual.onnx", 
    text_model_path=model_folder + "/textual.onnx"
)

# Infer
output = pipeline(
    image=CLIPVisualInput(images=images),
    text=CLIPTextInput(text=possible_classes),
).text_scores

for i in range(len(output)):
    prediction = possible_classes[np.argmax(output[i])]
    print(f"Image {images[i]} is a picture of {prediction}")

"""
Image basilica.jpg is a picture of a church
Image buddy.jpeg is a picture of a dog
Image thailand.jpg is a picture of an elephant
"""
```