|
--- |
|
pipeline_tag: zero-shot-classification |
|
base_model: laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K |
|
inference: false |
|
tags: |
|
- deepsparse |
|
--- |
|
This is a [SparseML](https://github.com/neuralmagic/sparseml) quantized version of |
|
https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with |
|
the [DeepSparse](https://github.com/neuralmagic/deepsparse) CPU inference engine. |
|
It achieves **71.1%** zero-shot top-1 accuracy on ImageNet and **95.6%** zero-shot top-1 accuracy on Imagenette. |
|
For comparison the dense version (the original model) achieves **72.8%** on ImageNet and **95.7%** on Imagenette. |
|
|
|
On an Intel avx512 CPU machine with 64 cores and VNNI support, this model achieves a **2.35x** speedup for textual |
|
and **2.84x** speedup for visual inputs as compared to the full-precision model. With a batch size of 64, |
|
the throughput was measured as **1230 items/sec** for images and **2009 items/sec** for text. |
|
|
|
Notebook for basic usage: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZvU9ZSHJKSeJyH5bgxo_A-GSVIUcSt2E?usp=sharing) |
|
Notebook for Imagenette evaluation: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-Duq0YNtjzOnmuXCYo-5DDiOzeCItXpN?usp=sharing) |
|
|
|
## The team |
|
|
|
This model and the example pipeline were created by Eugenia Iofinova, Michael Goin, Chris Wendler, and Dan Alistarh. |
|
Special thanks to Abhinav Agarwalla and Alexandre Marques for technical support with parts of the project. |
|
|
|
## Setup for usage |
|
First, install DeepSparse with extensions for CLIP: |
|
``` |
|
pip install deepsparse-nightly[clip]>=1.7.0.20231210 |
|
``` |
|
|
|
Download some test images of a church, a dog, and elephants: |
|
``` |
|
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg |
|
wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg |
|
wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg |
|
``` |
|
|
|
For this model there is a second input that is the length of tokens, so run this input override code before making a text pipeline: |
|
```python |
|
import numpy as np |
|
from deepsparse.clip import CLIPTextPipeline |
|
|
|
def custom_process_inputs(self, inputs): |
|
if not isinstance(inputs.text, list): |
|
inputs.text = [inputs.text] |
|
if not isinstance(inputs.text[0], str): |
|
return inputs.text |
|
tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)] |
|
tokens = np.stack(tokens, axis=0) |
|
tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1]) |
|
return [tokens, tokens_lengths] |
|
|
|
# This overrides the process_inputs function globally for all CLIPTextPipeline classes |
|
CLIPTextPipeline.process_inputs = custom_process_inputs |
|
``` |
|
|
|
## Text embedding pipeline |
|
|
|
Here is an example of how to create and use a [DeepSparse pipeline for text embeddings](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/text_pipeline.py). |
|
```python |
|
from deepsparse import Pipeline |
|
from huggingface_hub import snapshot_download |
|
|
|
# Download the model from HF |
|
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds") |
|
|
|
text_embed_pipeline = Pipeline.create(task="clip_text", model_path=model_folder + "/textual.onnx") |
|
|
|
text = ["ice cream", "an elephant", "a dog", "a building", "a church"] |
|
|
|
embeddings = text_embed_pipeline(text=text).text_embeddings |
|
for i in range(len(embeddings)): |
|
print(embeddings[i].shape) |
|
print(embeddings[i]) |
|
``` |
|
|
|
## Image embedding pipeline |
|
|
|
Here is an example of how to create and use a [DeepSparse pipeline for image embeddings](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/visual_pipeline.py). |
|
```python |
|
from deepsparse import Pipeline |
|
from huggingface_hub import snapshot_download |
|
|
|
# Download the model from HF |
|
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds") |
|
|
|
image_embed_pipeline = Pipeline.create(task="clip_visual", model_path=model_folder + "/visual.onnx") |
|
|
|
images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"] |
|
|
|
embeddings = image_embed_pipeline(images=images).image_embeddings |
|
for i in range(len(embeddings)): |
|
print(embeddings[i].shape) |
|
print(embeddings[i]) |
|
``` |
|
|
|
## Zero-shot image classification pipeline |
|
|
|
Since CLIP trained both the text and image embedding models in tandem, we can generate embeddings for both and relate them together without retraining. Here is an example of how to create and use a [DeepSparse pipeline for zero-shot image classification](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/zeroshot_pipeline.py). |
|
```python |
|
from deepsparse import Pipeline |
|
from deepsparse.clip import ( |
|
CLIPTextInput, |
|
CLIPVisualInput, |
|
CLIPZeroShotInput |
|
) |
|
from huggingface_hub import snapshot_download |
|
|
|
# Download the model from HF |
|
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds") |
|
|
|
possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"] |
|
images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"] |
|
|
|
# Load the model into DeepSparse |
|
pipeline = Pipeline.create( |
|
task="clip_zeroshot", |
|
visual_model_path=model_folder + "/visual.onnx", |
|
text_model_path=model_folder + "/textual.onnx" |
|
) |
|
|
|
# Infer |
|
output = pipeline( |
|
image=CLIPVisualInput(images=images), |
|
text=CLIPTextInput(text=possible_classes), |
|
).text_scores |
|
|
|
for i in range(len(output)): |
|
prediction = possible_classes[np.argmax(output[i])] |
|
print(f"Image {images[i]} is a picture of {prediction}") |
|
|
|
""" |
|
Image basilica.jpg is a picture of a church |
|
Image buddy.jpeg is a picture of a dog |
|
Image thailand.jpg is a picture of an elephant |
|
""" |
|
``` |