This is a quantized version of https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with (DeepSparse)[https://github.com/neuralmagic/deepsparse] It achieves 71.1% one-shot accuracy on ImageNet. ## Usage First, install DeepSparse with extensions for CLIP: ``` pip install deepsparse-nightly[clip]>=1.7.0.20231210 ``` Download some test images of a church, a dog, and elephants: ``` wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg ``` Then make and run a pipeline in Python: ```python import numpy as np from deepsparse import Pipeline from deepsparse.clip import ( CLIPTextInput, CLIPVisualInput, CLIPZeroShotInput ) def new_process_inputs(self, inputs: CLIPTextInput): if not isinstance(inputs.text, list): inputs.text = [inputs.text] if not isinstance(inputs.text[0], str): return inputs.text tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)] tokens = np.stack(tokens, axis=0) tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1]) return [tokens, tokens_lengths] # This overrides the process_inputs function globally for all CLIPTextPipeline classes, # so when we make a zeroshot pipeline later that uses this class, it will use this edit! CLIPTextPipeline.process_inputs = new_process_inputs possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"] images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"] pipeline = Pipeline.create(task="clip_zeroshot", visual_model_path="visual.onnx", text_model_path="textual.onnx") pipeline_input = CLIPZeroShotInput( image=CLIPVisualInput(images=images), text=CLIPTextInput(text=possible_classes), ) output = pipeline(pipeline_input).text_scores for i in range(len(output)): prediction = possible_classes[np.argmax(output[i])] print(f"Image {images[i]} is a picture of {prediction}") """ Image basilica.jpg is a picture of a church Image buddy.jpeg is a picture of a dog Image thailand.jpg is a picture of an elephant """ ```