anahita-b commited on
Commit
e54c068
1 Parent(s): 9bdaa4b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -1
README.md CHANGED
@@ -31,7 +31,39 @@ Vision-Language (VL) models with the Two-Tower architecture have dominated visua
31
 
32
  ### How to use
33
 
34
- Here is how to use this model to perform image and text matching:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ```python
37
  from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
@@ -54,6 +86,7 @@ for text in texts:
54
  scores[text] = outputs.logits[0,1].item()
55
  ```
56
 
 
57
  Here is how to use this model to perform masked language modeling:
58
 
59
  ```python
 
31
 
32
  ### How to use
33
 
34
+ Here is how to use this model to perform contrastive learning between image and text pairs:
35
+
36
+ ```python
37
+ from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
38
+ import requests
39
+ from PIL import Image
40
+ import torch
41
+
42
+ device = torch.device('cuda')
43
+ image_urls = [
44
+ "https://farm4.staticflickr.com/3395/3428278415_81c3e27f15_z.jpg",
45
+    "http://images.cocodataset.org/val2017/000000039769.jpg"]
46
+ texts = [
47
+ "two dogs in a car",
48
+ "two cats sleeping on a couch"]
49
+ images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls]
50
+
51
+ processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm")
52
+ model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")model.to(device)
53
+
54
+ inputs  = processor(images, texts, padding=True, return_tensors="pt").to(device)
55
+ outputs = model(**inputs, labels=torch.ones(2,device=device))
56
+
57
+ inputs  = processor(images, texts[::-1], padding=True, return_tensors="pt").to(device)
58
+ outputs_swapped = model(**inputs, labels=torch.ones(2,device=device))
59
+
60
+ print('Loss', outputs.loss.item())
61
+ print('Loss with swapped images', outputs_swapped.loss.item())
62
+ # Loss 0.0027269450947642326
63
+ # Loss with swapped images 2.987490177154541
64
+ ```
65
+
66
+ Here is how to use this model to perform image and text matching
67
 
68
  ```python
69
  from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
 
86
  scores[text] = outputs.logits[0,1].item()
87
  ```
88
 
89
+
90
  Here is how to use this model to perform masked language modeling:
91
 
92
  ```python