yhavinga commited on
Commit
4a5c010
1 Parent(s): 8ffc9db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -1
README.md CHANGED
@@ -1 +1,171 @@
1
- Wandb train run: https://wandb.ai/yepster/ul2-large-de-neddx2-en-nl/runs/30arxggk?workspace=user-yepster
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Wandb train run:
2
+
3
+
4
+ ---
5
+ language:
6
+ - nl
7
+ - en
8
+ - multilingual
9
+ license: apache-2.0
10
+ tags:
11
+ - dutch
12
+ - english
13
+ - t5
14
+ - t5x
15
+ - ul2
16
+ - seq2seq
17
+ - translation
18
+ datasets:
19
+ - yhavinga/mc4_nl_cleaned
20
+ - yhavinga/nedd_wiki_news
21
+ pipeline_tag: translation
22
+ widget:
23
+ - text: >-
24
+ Redistricting and West Virginia’s shrinking population forced the state’s
25
+ Republican Legislature to pit Mr. McKinley, a six-term Republican with a
26
+ pragmatic bent, against Mr. Mooney, who has served four terms marked more
27
+ by conservative rhetoric than legislative achievements.
28
+ - text: >-
29
+ It is a painful and tragic spectacle that rises before me: I have drawn
30
+ back the curtain from the rottenness of man. This word, in my mouth, is at
31
+ least free from one suspicion: that it involves a moral accusation against
32
+ humanity.
33
+ - text: >-
34
+ Young Wehling was hunched in his chair, his head in his hand. He was so
35
+ rumpled, so still and colorless as to be virtually invisible. His
36
+ camouflage was perfect, since the waiting room had a disorderly and
37
+ demoralized air, too. Chairs and ashtrays had been moved away from the
38
+ walls. The floor was paved with spattered dropcloths.
39
+ ---
40
+
41
+ # ul2-large-en-nl for English to Dutch translation
42
+
43
+ Fine-tuned T5 model on English to Dutch translation that was pretrained on Dutch using a UL2 (Mixture-of-Denoisers) objective.
44
+ The T5 model was introduced in
45
+ [this paper](https://arxiv.org/abs/1910.10683)
46
+ and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
47
+ The UL2 objective was introduced in
48
+ [this paper](https://arxiv.org/abs/2205.05131)
49
+ and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
50
+
51
+
52
+
53
+ ## Model description
54
+
55
+ T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
56
+
57
+ `ul2-large-en-nl-v3` T5 is a transformers model fine-tuned on parallel sentence and paragraph pairs
58
+ sampled from books.
59
+
60
+ This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
61
+ - GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
62
+ - Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
63
+ - Pre-trained on self-supervised objective only without mixing in the downstream tasks
64
+ - No parameter sharing between embedding and classifier layer
65
+
66
+
67
+ ### UL2 pretraining objective
68
+
69
+ This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
70
+ paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
71
+ the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
72
+ that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
73
+ three denoising tasks:
74
+
75
+ 1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
76
+ 2. X-denoising (or extreme span corruption); and
77
+ 3. S-denoising (or sequential PrefixLM).
78
+
79
+ During pre-training, we sample from the available denoising tasks based on user-specified ratios.
80
+ UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
81
+ denoising task. During the pre-training, a paradigm token is inserted to the input
82
+ (`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
83
+ Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
84
+ fine-tuning tasks.
85
+
86
+ ## Intended uses & limitations
87
+
88
+ This model was fine-tuned on parallel sentence and paragraph pairs and can be used
89
+ for machine translation.
90
+
91
+ ### How to use
92
+
93
+ Here is how to use this model in PyTorch:
94
+
95
+ ```python
96
+ model_name = "yhavinga/ul2-large-en-nl-v3"
97
+ from transformers import AutoTokenizer
98
+ from transformers import AutoModelForSeq2SeqLM
99
+ from transformers import pipeline
100
+ import torch
101
+ device_num = 0 if torch.cuda.is_available() else -1
102
+ device = "cpu" if device_num < 0 else f"cuda:{device_num}"
103
+
104
+ tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
105
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name, use_auth_token=True).to(
106
+ device
107
+ )
108
+ params = {"max_length": 370, "num_beams": 4, "early_stopping": True}
109
+ translator = pipeline("translation", tokenizer=tokenizer, model=model, device=device_num)
110
+ print(translator("Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible.",
111
+ **params)[0]['translation_text'])
112
+ ```
113
+
114
+
115
+ ### Limitations and bias
116
+
117
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
118
+ Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
119
+
120
+ ## Training data
121
+
122
+ The `ul2-large-en-nl` T5 model was pre-trained simultaneously on a combination of several datasets,
123
+ including the `full` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
124
+ crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), and a subset of "mc4_nl_cleaned"
125
+ containing only texts from Dutch newspapers.
126
+
127
+ After pre-training, the model was
128
+ fine-tuned on a translation dataset containing 13 million sentence and paragraph pairs
129
+ sampled from books.
130
+
131
+ ## Training procedure
132
+
133
+ ### Preprocessing
134
+
135
+ The ul2-large-en-nl T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
136
+ The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper,
137
+ `[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
138
+ During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
139
+ The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
140
+ between `dutch` and `Dutch`.
141
+ Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
142
+
143
+ ### Fine-tuning
144
+
145
+ This model was fine-tuned on a dataset containing 13M sentence and paragraph translation pairs sampled
146
+ from books for three epochs.
147
+
148
+ Wandb run https://wandb.ai/yepster/ul2-large-de-neddx2-en-nl/runs/30arxggk?workspace=user-yepster
149
+
150
+ * Pre-trained model used as starting point: yhavinga/ul2-large-dutch-english (3150k checkpoint)
151
+
152
+ For the concluding ~half epoch, a HuggingFace Flax based trainer was used with the following settings:
153
+
154
+ - **Batch Size**: Total effective batch size of 512, achieved via per-device settings and gradient accumulation.
155
+ - **Learning Rate**: Set at 0.0009, with linear schedule and 500 step warmup.
156
+ - **Optimizer**: AdamW with beta1=0.9, beta2=0.997, epsilon=1e-8.
157
+ - **Weight Decay**: Configured to 0.001 for regularization.
158
+ - **Additional Parameters**: Dropout rate of 0.01, label smoothing factor of 0.11, and sequence length of 370 tokens. Model datatype is bfloat16, z_loss at 0.0001.
159
+
160
+ ## Evaluation results
161
+
162
+ TBD
163
+
164
+ ## Acknowledgements
165
+
166
+ This project would not have been possible without compute generously provided by Google through the
167
+ [TPU Research Cloud](https://sites.research.google/trc/).
168
+ Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
169
+ Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
170
+
171
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)