valhalla commited on
Commit
0993c71
1 Parent(s): 2cea2ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -7
README.md CHANGED
@@ -17,15 +17,10 @@ January 2021
17
 
18
  ### Model Type
19
 
20
- The base model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
21
 
22
- ### Model Version
23
 
24
- Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
25
-
26
- *This port does not include the ResNet model.*
27
-
28
- Please see the paper linked below for further details about their specification.
29
 
30
  ### Documents
31
 
 
17
 
18
  ### Model Type
19
 
20
+ The base model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
21
 
22
+ The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
23
 
 
 
 
 
 
24
 
25
  ### Documents
26