neuralmagic
/

CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds

Zero-Shot Classification

Model card Files Files and versions Community

CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds / README.md

mgoin's picture

Update README.md

6b438fd 9 months ago

|

No virus

2.98 kB

	---
	pipeline_tag: zero-shot-classification
	base_model: laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K
	tags:
	- deepsparse
	---
	This is a quantized version of https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with [DeepSparse](https://github.com/neuralmagic/deepsparse). It achieves 71.1% one-shot accuracy on ImageNet.

	## Usage
	[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZvU9ZSHJKSeJyH5bgxo_A-GSVIUcSt2E?usp=sharing)
	First, install DeepSparse with extensions for CLIP:
	```
	pip install deepsparse-nightly[clip]>=1.7.0.20231210
	```

	Download some test images of a church, a dog, and elephants:
	```
	wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
	wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg
	wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg
	```

	For this model there is a second input that is the length of tokens, so run this input override before making the pipeline:
	```python
	import numpy as np
	from deepsparse.clip import CLIPTextPipeline

	def custom_process_inputs(self, inputs):
	if not isinstance(inputs.text, list):
	inputs.text = [inputs.text]
	if not isinstance(inputs.text[0], str):
	return inputs.text
	tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)]
	tokens = np.stack(tokens, axis=0)
	tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1])
	return [tokens, tokens_lengths]

	# This overrides the process_inputs function globally for all CLIPTextPipeline classes
	CLIPTextPipeline.process_inputs = custom_process_inputs
	```

	Then make and run a pipeline in Python:
	```python
	from deepsparse import Pipeline
	from deepsparse.clip import (
	CLIPTextInput,
	CLIPVisualInput,
	CLIPZeroShotInput
	)
	from huggingface_hub import snapshot_download

	# Download the model from HF
	model_folder = snapshot_download(repo_id="mgoin/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

	possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"]
	images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

	# Load the model into DeepSparse
	pipeline = Pipeline.create(
	task="clip_zeroshot",
	visual_model_path=model_folder + "/visual.onnx",
	text_model_path=model_folder + "/textual.onnx"
	)

	# Infer
	output = pipeline(
	image=CLIPVisualInput(images=images),
	text=CLIPTextInput(text=possible_classes),
	).text_scores

	for i in range(len(output)):
	prediction = possible_classes[np.argmax(output[i])]
	print(f"Image {images[i]} is a picture of {prediction}")

	"""
	Image basilica.jpg is a picture of a church
	Image buddy.jpeg is a picture of a dog
	Image thailand.jpg is a picture of an elephant
	"""
	```