en-zu/jw300-baseline/README.md · chrisjay/masakhane_benchmarks at afa4e5520e27cc5492f74537cbbf2b78e8127050

Data

JW300 : English-Zulu

Model Architecture

Text Preprocessing

- Remove blank/empty rows : 9037(0.85 %) samples
- Removed duplicates from source text : 82999(7.88 %) samples
- Removed duplicates from target text : 5045(0.52 %) samples
- Removed all numeric-only text : 182(0.02 %) samples
- Removed rows where text is fewer than orequal to 8 characters long from source text: 6272(0.65 %) samples
- Removed rows where text is fewer than orequal to 8 characters long from target text: 713(0.07 %) samples
- Removed rows where text is in test set: 1068(0.11 %) samples

BPE Tokenization

- vocab size : 4000 (superior results than 10X)

Model Config

- Details in supplied config file but used fewer transformer layers than in default notebook, with more attention heads and lower embedding size
- Trained for 235000 steps
- Took few hours on a single P100 GPU on Google colab over a three days (stopped training  saved best model then reloaded that model the next day)

Results

Curious analysis of the tokenization

There are 66255 english tokens in the test set vocab, 2072 are unique

There are 67851 zulu tokens in the test set vocab, 2336 are unique

These results are in the same notebook as used for training. (Could something similar help inform BPE vocab size choices ?)

Translation results

2019-11-13 07:43:32,728 Hello! This is Joey-NMT.

2019-11-13 07:44:03,502 dev bleu: 13.64 [Beam search decoding with beam size = 5 and alpha = 1.0]

2019-11-13 07:44:24,289 test bleu: 4.87 [Beam search decoding with beam size = 5 and alpha = 1.0]`

Download model weights from : here