HugoLaurencon
commited on
Commit
•
d3f1708
1
Parent(s):
a93d027
Update README.md
Browse files
README.md
CHANGED
@@ -117,14 +117,14 @@ The model is trained on the following data mixture of openly accessible English
|
|
117 |
|-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
|
118 |
| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | 114.9B | 353M | 1 | 73.85% |
|
119 |
| [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | 3.192B | TODO | 3 | 6.15% |
|
120 |
-
| [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | 29.9B |
|
121 |
| [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | 1.6B | 70M | 3 | 2.82% | |
|
122 |
|
123 |
**OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
|
124 |
|
125 |
**Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
|
126 |
|
127 |
-
**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image. We deduplicated it
|
128 |
|
129 |
**PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.
|
130 |
|
|
|
117 |
|-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
|
118 |
| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | 114.9B | 353M | 1 | 73.85% |
|
119 |
| [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | 3.192B | TODO | 3 | 6.15% |
|
120 |
+
| [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | 29.9B | 1.120B | 1 | 17.18%
|
121 |
| [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | 1.6B | 70M | 3 | 2.82% | |
|
122 |
|
123 |
**OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
|
124 |
|
125 |
**Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
|
126 |
|
127 |
+
**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image. We deduplicated it (following [this paper](https://arxiv.org/abs/2303.12733)), slightly filtered it, and removed the opted-out images.
|
128 |
|
129 |
**PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.
|
130 |
|