yhavinga commited on
Commit
e0dfc71
1 Parent(s): 311feca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -16
README.md CHANGED
@@ -14,31 +14,42 @@ datasets:
14
  ---
15
  # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
- Training is not finished!
18
 
19
- Dataset:
20
 
21
- * [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
22
- * dataset config: full (33B tokens)
23
 
24
- Tokenizer:
25
 
26
- * Tokenizer trained on mC4 with scripts from the Huggingface
27
- Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
 
28
 
29
- Training details:
 
 
 
 
 
 
 
30
 
31
- * Trained for 320K of 520K steps (31 dec 2021)
32
  * Block size: 512
33
  * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
34
  * Warmup steps: 5000
35
  * Weight decay: 0.01
36
 
37
- Work in progress. Dec 2021-Jan2022
38
 
39
- * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
40
- * Thanks to @gsarti for creating the [t5-flax-gcp
41
- repository](https://github.com/gsarti/t5-flax-gcp).
42
- * Also thanks to the creators of [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) and
43
- [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
44
- for sharing their training scripts!
 
 
 
 
14
  ---
15
  # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
+ A GPT2 medium sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.2 on cleaned Dutch mC4.
18
 
19
+ ## Tokenizer
20
 
21
+ * Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
22
+ Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
23
 
24
+ ## Dataset
25
 
26
+ This model was trained on of the `full` configuration (33B tokens) of
27
+ [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
28
+ which is the original mC4, except
29
 
30
+ * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
31
+ * Sentences with less than 3 words are removed
32
+ * Sentences with a word of more than 1000 characters are removed
33
+ * Documents with less than 5 sentences are removed
34
+ * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
35
+ "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
36
+
37
+ ## Training details
38
 
39
+ * Trained for 320K of 520K steps (61%, 20B tokens)
40
  * Block size: 512
41
  * Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
42
  * Warmup steps: 5000
43
  * Weight decay: 0.01
44
 
45
+ ## Acknowledgements
46
 
47
+ This project would not have been possible without compute generously provided by Google through the
48
+ [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
49
+ instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
50
+ and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
51
+
52
+ * [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)
53
+ * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
54
+ * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
55
+ * [language model training examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)