HuggingFaceM4
/

idefics-80b

@@ -117,14 +117,14 @@ The model is trained on the following data mixture of openly accessible English
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
 | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | 114.9B                      | 353M                      | 1      | 73.85%                                  |
 | [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | 3.192B                     | TODO                      | 3      | 6.15%                                  |
-| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | 29.9B                      | TODO                      | 1      | 17.18%
 | [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | 1.6B                      | 70M                      | 3      | 2.82%                                   |                                |
 **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
 **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
-**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image. We deduplicated it, following [this paper](https://arxiv.org/abs/2303.12733).
 **PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.

 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
 | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | 114.9B                      | 353M                      | 1      | 73.85%                                  |
 | [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | 3.192B                     | TODO                      | 3      | 6.15%                                  |
+| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | 29.9B                      | 1.120B                      | 1      | 17.18%
 | [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | 1.6B                      | 70M                      | 3      | 2.82%                                   |                                |
 **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
 **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
+**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image. We deduplicated it (following [this paper](https://arxiv.org/abs/2303.12733)), slightly filtered it, and removed the opted-out images.
 **PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.