--- pipeline_tag: zero-shot-classification base_model: laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K inference: false tags: - deepsparse --- This is a [SparseML](https://github.com/neuralmagic/sparseml) quantized version of https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with [DeepSparse](https://github.com/neuralmagic/deepsparse). It achieves 71.1% one-shot accuracy on ImageNet and 95.6% one-shot accuracy on Imagenette. Notebook for basic usage: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZvU9ZSHJKSeJyH5bgxo_A-GSVIUcSt2E?usp=sharing) Notebook for Imagenette evaluation: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-Duq0YNtjzOnmuXCYo-5DDiOzeCItXpN?usp=sharing) ## Setup for usage First, install DeepSparse with extensions for CLIP: ``` pip install deepsparse-nightly[clip]>=1.7.0.20231210 ``` Download some test images of a church, a dog, and elephants: ``` wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg ``` For this model there is a second input that is the length of tokens, so run this input override code before making a text pipeline: ```python import numpy as np from deepsparse.clip import CLIPTextPipeline def custom_process_inputs(self, inputs): if not isinstance(inputs.text, list): inputs.text = [inputs.text] if not isinstance(inputs.text[0], str): return inputs.text tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)] tokens = np.stack(tokens, axis=0) tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1]) return [tokens, tokens_lengths] # This overrides the process_inputs function globally for all CLIPTextPipeline classes CLIPTextPipeline.process_inputs = custom_process_inputs ``` ## Text embedding pipeline Here is an example of how to create and use a [DeepSparse pipeline for text embeddings](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/text_pipeline.py). ```python from deepsparse import Pipeline from huggingface_hub import snapshot_download # Download the model from HF model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds") text_embed_pipeline = Pipeline.create(task="clip_text", model_path=model_folder + "/textual.onnx") text = ["ice cream", "an elephant", "a dog", "a building", "a church"] embeddings = text_embed_pipeline(text=text).text_embeddings for i in range(len(embeddings)): print(embeddings[i].shape) print(embeddings[i]) ``` ## Image embedding pipeline Here is an example of how to create and use a [DeepSparse pipeline for image embeddings](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/visual_pipeline.py). ```python from deepsparse import Pipeline from huggingface_hub import snapshot_download # Download the model from HF model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds") image_embed_pipeline = Pipeline.create(task="clip_visual", model_path=model_folder + "/visual.onnx") images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"] embeddings = image_embed_pipeline(images=images).image_embeddings for i in range(len(embeddings)): print(embeddings[i].shape) print(embeddings[i]) ``` ## Zero-shot image classification pipeline Since CLIP trained both the text and image embedding models in tandem, we can generate embeddings for both and relate them together without retraining. Here is an example of how to create and use a [DeepSparse pipeline for zero-shot image classification](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/zeroshot_pipeline.py). ```python from deepsparse import Pipeline from deepsparse.clip import ( CLIPTextInput, CLIPVisualInput, CLIPZeroShotInput ) from huggingface_hub import snapshot_download # Download the model from HF model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds") possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"] images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"] # Load the model into DeepSparse pipeline = Pipeline.create( task="clip_zeroshot", visual_model_path=model_folder + "/visual.onnx", text_model_path=model_folder + "/textual.onnx" ) # Infer output = pipeline( image=CLIPVisualInput(images=images), text=CLIPTextInput(text=possible_classes), ).text_scores for i in range(len(output)): prediction = possible_classes[np.argmax(output[i])] print(f"Image {images[i]} is a picture of {prediction}") """ Image basilica.jpg is a picture of a church Image buddy.jpeg is a picture of a dog Image thailand.jpg is a picture of an elephant """ ```