--- language: nl widget: - text: "In het jaar 2030 zullen we" - text: "Toen ik gisteren volledig in de ban was van" - text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul" - text: "In Israël was een strenge lockdown" tags: - gpt2-medium - gpt2 pipeline_tag: text-generation datasets: - yhavinga/mc4_nl_cleaned --- # GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱 A GPT2 medium-sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.1 on cleaned Dutch mC4. ## How To Use You can use this GPT2-model directly with a pipeline for text generation. ```python MODEL_DIR='yhavinga/gpt2-medium-dutch' from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR) model = GPT2LMHeadModel.from_pretrained(MODEL_DIR) generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':100}) generated_text = generator('Even later landden wij op het vliegveld van', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0)) ``` *"Even later landden wij op het vliegveld van" - " Calvi. Wij kregen de gelegenheid om ons van wapens te voorzien, wat ons te pas kwam bij onze pogingen een plaats te veroveren in dat soort wereld waarin alleen mannen zich kunnen bewegen – en vooral als zij alleen maar met elkaar willen praten, omdat er altijd genoeg mensen zijn die de moeite niet nemen om hun te vragen of zij met hen over politiek spreken – en voor we onze tenten opzochten liepen zij nog even binnen langs mijn kantoortje, maar ze hadden er geen"* ## Tokenizer * BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling). ## Dataset This model was trained on of the `full` configuration (33B tokens) of [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned), which is the original mC4, except * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed * Sentences with less than 3 words are removed * Sentences with a word of more than 1000 characters are removed * Documents with less than 5 sentences are removed * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed. ## Models TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model. * `yhavinga/gpt-neo-125M-dutch` is trained on a fraction of C4 containing only wikipedia and news sites. * The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps. | | model | params | train seq len | ppl | loss | batch size | epochs | steps | optim | lr | duration | config | |-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------| | [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M | 512 | 19.9 | 2.99 | 128 | 8 | 558608 | adamw | 2.4e-3 | 1d 12h | news+wiki | | [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) | gpt2 | 345M | 512 | 15.1 | 2.71 | 128 | 4 | 320000/520502 | adafactor | 8e-4 | 7d 2h | full | | [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch) | gpt2 | 762M | 512 | 15.1 | 2.72 | 32 | 1 | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h | large | | [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B | 512 | 16.0 | 2.77 | 16 | 1 | 960000/3049896 | adafactor | 5e-4 | 7d 11h | full | ## Acknowledgements This project would not have been possible without compute generously provided by Google through the [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM, and training the models: * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp) * [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling) * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian) Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)