Data
JW300 : English-Zulu
Model Architecture
Text Preprocessing
- Remove blank/empty rows : 9037(0.85 %) samples
- Removed duplicates from source text : 82999(7.88 %) samples
- Removed duplicates from target text : 5045(0.52 %) samples
- Removed all numeric-only text : 182(0.02 %) samples
- Removed rows where text is fewer than orequal to 8 characters long from source text: 6272(0.65 %) samples
- Removed rows where text is fewer than orequal to 8 characters long from target text: 713(0.07 %) samples
- Removed rows where text is in test set: 1068(0.11 %) samples
BPE Tokenization
- vocab size : 4000 (superior results than 10X)
Model Config
- Details in supplied config file but used fewer transformer layers than in default notebook, with more attention heads and lower embedding size
- Trained for 235000 steps
- Took few hours on a single P100 GPU on Google colab over a three days (stopped training saved best model then reloaded that model the next day)
Results
Curious analysis of the tokenization
There are 66255 english tokens in the test set vocab, 2072 are unique
There are 67851 zulu tokens in the test set vocab, 2336 are unique
These results are in the same notebook as used for training. (Could something similar help inform BPE vocab size choices ?)
Translation results
2019-11-13 07:43:32,728 Hello! This is Joey-NMT.
2019-11-13 07:44:03,502 dev bleu: 13.64 [Beam search decoding with beam size = 5 and alpha = 1.0]
2019-11-13 07:44:24,289 test bleu: 4.87 [Beam search decoding with beam size = 5 and alpha = 1.0]`
Download model weights from : here