Marqo
/

marqo-fashionSigLIP

@@ -5,6 +5,7 @@ tags:
 - fashion
 - multimodal retrieval
 - siglip
 library_name: open_clip
 pipeline_tag: zero-shot-image-classification
 license: apache-2.0
@@ -25,6 +26,9 @@ The model was fine-tuned from ViT-B-16-SigLIP (webli).
 ## Usage
 The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
 ```python
@@ -49,6 +53,55 @@ with torch.no_grad(), torch.cuda.amp.autocast():
 print("Label probs:", text_probs)
 ```
 ## Benchmark Results
 Average evaluation results on 6 public multimodal fashion datasets ([Atlas](https://huggingface.co/datasets/Marqo/atlas), [DeepFashion (In-shop)](https://huggingface.co/datasets/Marqo/deepfashion-inshop), [DeepFashion (Multimodal)](https://huggingface.co/datasets/Marqo/deepfashion-multimodal), [Fashion200k](https://huggingface.co/datasets/Marqo/fashion200k), [KAGL](https://huggingface.co/datasets/Marqo/KAGL), and [Polyvore](https://huggingface.co/datasets/Marqo/polyvore)) are reported below:

 - fashion
 - multimodal retrieval
 - siglip
+- transformers.js
 library_name: open_clip
 pipeline_tag: zero-shot-image-classification
 license: apache-2.0
 ## Usage
+### OpenCLIP
 The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
 ```python
 print("Label probs:", text_probs)
 ```
+### Transformers.js
+You can also run the model in JavaScript with the [Transformers.js](https://huggingface.co/docs/transformers.js) library.
+First, install it from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
+```bash
+npm i @huggingface/transformers
+```
+Then, compute embeddings as follows:
+```js
+import { SiglipTextModel, SiglipVisionModel, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';
+const model_id = 'Marqo/marqo-fashionSigLIP';
+// Load tokenizer and text model
+const tokenizer = await AutoTokenizer.from_pretrained(model_id);
+const text_model = await SiglipTextModel.from_pretrained(model_id);
+// Load processor and vision model
+const processor = await AutoProcessor.from_pretrained(model_id);
+const vision_model = await SiglipVisionModel.from_pretrained(model_id);
+// Run tokenization
+const texts = ['a hat', 'a t-shirt', 'shoes'];
+const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
+// Compute text embeddings
+const { text_embeds } = await text_model(text_inputs);
+// Read image and run processor
+const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
+const image_inputs = await processor(image);
+// Compute vision embeddings
+const { image_embeds } = await vision_model(image_inputs);
+// Compute similarity scores
+const normalized_text_embeds = text_embeds.normalize().tolist();
+const normalized_image_embeds = image_embeds.normalize().tolist()[0];
+const text_probs = softmax(normalized_text_embeds.map((text_embed) =>
+    100.0 * dot(normalized_image_embeds, text_embed)
+));
+console.log(text_probs);
+// [0.9860219105287394, 0.00777916527489097, 0.006198924196369721]
+```
 ## Benchmark Results
 Average evaluation results on 6 public multimodal fashion datasets ([Atlas](https://huggingface.co/datasets/Marqo/atlas), [DeepFashion (In-shop)](https://huggingface.co/datasets/Marqo/deepfashion-inshop), [DeepFashion (Multimodal)](https://huggingface.co/datasets/Marqo/deepfashion-multimodal), [Fashion200k](https://huggingface.co/datasets/Marqo/fashion200k), [KAGL](https://huggingface.co/datasets/Marqo/KAGL), and [Polyvore](https://huggingface.co/datasets/Marqo/polyvore)) are reported below: