File size: 5,911 Bytes
6b438fd
 
 
81825c5
6b438fd
 
 
607f078
 
 
93ad4ce
 
063a736
607f078
 
 
 
063a736
7610e80
b198789
51d9443
 
 
 
 
17680c0
b198789
 
 
 
 
 
 
 
 
 
 
 
17680c0
b198789
 
43517d5
b198789
43517d5
b198789
 
 
 
 
 
 
 
 
43517d5
 
 
 
17680c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43517d5
 
 
 
 
 
 
 
 
 
0f08637
b198789
 
 
 
43517d5
 
17680c0
 
43517d5
 
b198789
81bfb1c
43517d5
b198789
 
43517d5
b198789
 
 
 
 
 
 
 
 
 
6b438fd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
pipeline_tag: zero-shot-classification
base_model: laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K
inference: false
tags:
- deepsparse
---
This is a [SparseML](https://github.com/neuralmagic/sparseml) quantized version of
https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with
the [DeepSparse](https://github.com/neuralmagic/deepsparse) CPU inference engine. 
It achieves **71.1%** zero-shot top-1 accuracy on ImageNet and **95.6%** zero-shot top-1 accuracy on Imagenette. 
For comparison the dense version (the original model) achieves **72.8%** on ImageNet and **95.7%** on Imagenette.

On an Intel avx512 CPU machine with 64 cores and VNNI support, this model achieves a **2.35x** speedup for textual
and **2.84x** speedup for visual inputs as compared to the full-precision model. With a batch size of 64,
the throughput was measured as **1230 items/sec** for images and **2009 items/sec** for text.

Notebook for basic usage: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZvU9ZSHJKSeJyH5bgxo_A-GSVIUcSt2E?usp=sharing)
Notebook for Imagenette evaluation: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-Duq0YNtjzOnmuXCYo-5DDiOzeCItXpN?usp=sharing)

## The team

This model and the example pipeline were created by Eugenia Iofinova, Michael Goin, Chris Wendler, and Dan Alistarh.
Special thanks to Abhinav Agarwalla and Alexandre Marques for technical support with parts of the project.

## Setup for usage
First, install DeepSparse with extensions for CLIP:
```
pip install deepsparse-nightly[clip]>=1.7.0.20231210
```

Download some test images of a church, a dog, and elephants:
```
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg
wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg
```

For this model there is a second input that is the length of tokens, so run this input override code before making a text pipeline:
```python
import numpy as np
from deepsparse.clip import CLIPTextPipeline

def custom_process_inputs(self, inputs):
    if not isinstance(inputs.text, list):
        inputs.text = [inputs.text]
    if not isinstance(inputs.text[0], str):
        return inputs.text
    tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)]
    tokens = np.stack(tokens, axis=0)
    tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1])
    return [tokens, tokens_lengths]

# This overrides the process_inputs function globally for all CLIPTextPipeline classes
CLIPTextPipeline.process_inputs = custom_process_inputs
```

## Text embedding pipeline

Here is an example of how to create and use a [DeepSparse pipeline for text embeddings](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/text_pipeline.py).
```python
from deepsparse import Pipeline
from huggingface_hub import snapshot_download

# Download the model from HF
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

text_embed_pipeline = Pipeline.create(task="clip_text", model_path=model_folder + "/textual.onnx")

text = ["ice cream", "an elephant", "a dog", "a building", "a church"]

embeddings = text_embed_pipeline(text=text).text_embeddings
for i in range(len(embeddings)):
    print(embeddings[i].shape)
    print(embeddings[i])
```

## Image embedding pipeline

Here is an example of how to create and use a [DeepSparse pipeline for image embeddings](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/visual_pipeline.py).
```python
from deepsparse import Pipeline
from huggingface_hub import snapshot_download

# Download the model from HF
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

image_embed_pipeline = Pipeline.create(task="clip_visual", model_path=model_folder + "/visual.onnx")

images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

embeddings = image_embed_pipeline(images=images).image_embeddings
for i in range(len(embeddings)):
    print(embeddings[i].shape)
    print(embeddings[i])
```

## Zero-shot image classification pipeline

Since CLIP trained both the text and image embedding models in tandem, we can generate embeddings for both and relate them together without retraining. Here is an example of how to create and use a [DeepSparse pipeline for zero-shot image classification](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/zeroshot_pipeline.py).
```python
from deepsparse import Pipeline
from deepsparse.clip import (
    CLIPTextInput,
    CLIPVisualInput,
    CLIPZeroShotInput
)
from huggingface_hub import snapshot_download

# Download the model from HF
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"]
images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

# Load the model into DeepSparse
pipeline = Pipeline.create(
    task="clip_zeroshot",
    visual_model_path=model_folder + "/visual.onnx",
    text_model_path=model_folder + "/textual.onnx"
)

# Infer
output = pipeline(
    image=CLIPVisualInput(images=images),
    text=CLIPTextInput(text=possible_classes),
).text_scores

for i in range(len(output)):
    prediction = possible_classes[np.argmax(output[i])]
    print(f"Image {images[i]} is a picture of {prediction}")

"""
Image basilica.jpg is a picture of a church
Image buddy.jpeg is a picture of a dog
Image thailand.jpg is a picture of an elephant
"""
```