VictorSanh
commited on
Commit
•
cb68b97
1
Parent(s):
1eaac25
oops
Browse files
README.md
CHANGED
@@ -116,7 +116,7 @@ The model is trained on the following data mixture of openly accessible English
|
|
116 |
|
117 |
For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
|
118 |
|
119 |
-
Following
|
120 |
|
121 |
The training objective is the standard next token prediction.
|
122 |
|
|
|
116 |
|
117 |
For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
|
118 |
|
119 |
+
Following [Dehghani et al., 2023](https://huggingface.co/papers/2302.05442), we apply a layer normalization on the projected queries and keys of both the Perceiver and cross-attention blocks, which improved training stability in our early experiments. We use the [RMSNorm](https://huggingface.co/papers/1910.07467) implementation for trainable Layer Norms.
|
120 |
|
121 |
The training objective is the standard next token prediction.
|
122 |
|