File size: 2,527 Bytes
49bca01
 
 
74babbc
 
 
 
49bca01
 
 
 
 
 
 
 
 
e0dfc71
cd84b33
e0dfc71
74babbc
e0dfc71
 
74babbc
e0dfc71
74babbc
e0dfc71
 
 
74babbc
e0dfc71
 
 
 
 
 
 
 
49bca01
e0dfc71
74babbc
 
 
 
49bca01
e0dfc71
49bca01
e0dfc71
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
language: nl
widget:
- text: "In het jaar 2030 zullen we"
- text: "Toen ik gisteren volledig in de ban was van"
- text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul"
- text: "In Israël was een strenge lockdown"
tags:
- gpt2-medium
- gpt2
pipeline_tag: text-generation
datasets:
- yhavinga/mc4_nl_cleaned
---
# GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱

A GPT2 medium sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.2 on cleaned Dutch mC4.

## Tokenizer

* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).

## Dataset

This model was trained on of the `full` configuration (33B tokens) of
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
which is the original mC4, except

  * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
  * Sentences with less than 3 words are removed
  * Sentences with a word of more than 1000 characters are removed
  * Documents with less than 5 sentences are removed
  * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
    "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
 
## Training details

* Trained for 320K of 520K steps (61%, 20B tokens)
* Block size: 512
* Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
* Warmup steps: 5000
* Weight decay: 0.01

## Acknowledgements

This project would not have been possible without compute generously provided by Google through the
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch. 

* [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)
* [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
* [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
* [language model training examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)