metadata

language: nl
widget:
  - text: In het jaar 2030 zullen we
  - text: Toen ik gisteren volledig in de ban was van
  - text: >-
      Studenten en leraren van de Bogazici Universiteit in de Turkse stad
      Istanbul
  - text: In Israël was een strenge lockdown
tags:
  - gpt2-medium
  - gpt2
pipeline_tag: text-generation
datasets:
  - yhavinga/mc4_nl_cleaned

GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱

A GPT2 medium sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.2 on cleaned Dutch mC4.

Tokenizer

Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface Transformers Flax examples.

Dataset

This model was trained on of the full configuration (33B tokens) of cleaned Dutch mC4, which is the original mC4, except

Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
Sentences with less than 3 words are removed
Sentences with a word of more than 1000 characters are removed
Documents with less than 5 sentences are removed
Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

Training details

Trained for 320K of 520K steps (61%, 20B tokens)
Block size: 512
Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
Warmup steps: 5000
Weight decay: 0.01

Acknowledgements

This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was also instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM, and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.