yhavinga
/

gpt2-medium-dutch

@@ -14,31 +14,42 @@ datasets:
 ---
 # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
-Training is not finished!
-Dataset:
-* [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
-* dataset config: full (33B tokens)
-Tokenizer:
-* Tokenizer trained on mC4 with scripts from the Huggingface
-  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
-Training details:
-* Trained for 320K of 520K steps (31 dec 2021)
 * Block size: 512
 * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
 * Warmup steps: 5000
 * Weight decay: 0.01
-Work in progress. Dec 2021-Jan2022
-* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
-* Thanks to @gsarti for creating the [t5-flax-gcp
-  repository](https://github.com/gsarti/t5-flax-gcp).
-* Also thanks to the creators of [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) and
-  [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
-  for sharing their training scripts!

 ---
 # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
+A GPT2 medium sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.2 on cleaned Dutch mC4.
+## Tokenizer
+* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
+  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
+## Dataset
+This model was trained on of the `full` configuration (33B tokens) of
+[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
+which is the original mC4, except
+  * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
+  * Sentences with less than 3 words are removed
+  * Sentences with a word of more than 1000 characters are removed
+  * Documents with less than 5 sentences are removed
+  * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
+    "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
+## Training details
+* Trained for 320K of 520K steps (61%, 20B tokens)
 * Block size: 512
 * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
 * Warmup steps: 5000
 * Weight decay: 0.01
+## Acknowledgements
+This project would not have been possible without compute generously provided by Google through the
+[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
+instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
+and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
+* [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)
+* [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
+* [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
+* [language model training examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)