BridgeTower
/

bridgetower-large-itm-mlm-itc

Inference Endpoints

Model card Files Files and versions Community

anahita-b commited on Mar 8, 2023

Commit

e54c068

•

1 Parent(s): 9bdaa4b

Update README.md

Files changed (1) hide show

README.md +34 -1

README.md CHANGED Viewed

@@ -31,7 +31,39 @@ Vision-Language (VL) models with the Two-Tower architecture have dominated visua
 ### How to use
-Here is how to use this model to perform image and text matching:
 ```python
 from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
@@ -54,6 +86,7 @@ for text in texts:
     scores[text] = outputs.logits[0,1].item()
 ```
 Here is how to use this model to perform masked language modeling:
 ```python

 ### How to use
+Here is how to use this model to perform contrastive learning between image and text pairs:
+```python
+from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
+import requests
+from PIL import Image
+import torch
+device = torch.device('cuda')
+image_urls = [
+    "https://farm4.staticflickr.com/3395/3428278415_81c3e27f15_z.jpg",
+    "http://images.cocodataset.org/val2017/000000039769.jpg"]
+texts = [
+    "two dogs in a car",
+    "two cats sleeping on a couch"]
+images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls]
+processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm")
+model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")model.to(device)
+inputs  = processor(images, texts, padding=True, return_tensors="pt").to(device)
+outputs = model(**inputs, labels=torch.ones(2,device=device))
+inputs  = processor(images, texts[::-1], padding=True, return_tensors="pt").to(device)
+outputs_swapped = model(**inputs, labels=torch.ones(2,device=device))
+print('Loss', outputs.loss.item())
+print('Loss with swapped images', outputs_swapped.loss.item())
+# Loss 0.0027269450947642326
+# Loss with swapped images 2.987490177154541
+```
+Here is how to use this model to perform image and text matching
 ```python
 from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
     scores[text] = outputs.logits[0,1].item()
 ```
 Here is how to use this model to perform masked language modeling:
 ```python