# Masakhane - Machine Translation for African Languages (Using JoeyNMT)

## Note before beginning:
### - The idea is that you should be able to make minimal changes to this in order to get SOME result for your own translation corpus. 

### - The tl;dr: Go to the **"TODO"** comments which will tell you what to update to get up and running

### - If you actually want to have a clue what you're doing, read the text and peek at the links

### - With 100 epochs, it should take around 7 hours to run in Google Colab

### - Once you've gotten a result for your language, please attach and email your notebook that generated it to masakhanetranslation@gmail.com

### - If you care enough and get a chance, doing a brief background on your language would be amazing. See examples in  [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

## Retrieve your data & make a parallel corpus

If you are wanting to use the JW300 data referenced on the Masakhane website or in our GitHub repo, you can use `opus-tools` to convert the data into a convenient format. `opus_read` from that package provides a convenient tool for reading the native aligned XML files and to convert them to TMX format. The tool can also be used to fetch relevant files from OPUS on the fly and to filter the data as necessary. [Read the documentation](https://pypi.org/project/opustools-pkg/) for more details.

Once you have your corpus files in TMX format (an xml structure which will include the sentences in your target language and your source language in a single file), we recommend reading them into a pandas dataframe. Thankfully, Jade wrote a silly `tmx2dataframe` package which converts your tmx file to a pandas dataframe. 

In [24]:
# from google.colab import drive
# drive.mount('/content/drive')

In [25]:
# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:
# These will also become the suffix's of all vocab and corpus files used throughout
import os
source_language = "en"
target_language = "luo" 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# This will save it to a folder in our gdrive instead!
# !mkdir -p "/content/drive/My Drive/masakhane/$src-$tgt-$tag"
# os.environ["gdrive_path"] = "/content/drive/My Drive/masakhane/%s-%s-%s" % (source_language, target_language, tag)

In [26]:
# !echo $gdrive_path

In [27]:
# Install opus-tools
#! pip install opustools-pkg

Uncomment cell below if notebook is being run for the first time and you need to download the data

In [5]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q

# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-luo.xml.gz not found. The following files are available for downloading:

   1 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en-luo.xml.gz
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en.zip
  14 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/luo.zip

 279 MB Total size
./JW300_latest_xml_en-luo.xml.gz ... 100% of 1 MB
./JW300_latest_xml_en.zip ... 100% of 263 MB
./JW300_latest_xml_luo.zip ... 100% of 14 MB


In [6]:
# Download the global test set.
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en
  
# And the specific test set for this language pair.
os.environ["trg"] = target_language 
os.environ["src"] = source_language 

! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$trg.en 
! mv test.en-$trg.en test.en
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$trg.$trg 
! mv test.en-$trg.$trg test.$trg

--2020-02-19 06:24:03--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.192.133, 151.101.128.133, 151.101.64.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.192.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277791 (271K) [text/plain]
Saving to: ‘test.en-any.en’


2020-02-19 06:24:04 (6.13 MB/s) - ‘test.en-any.en’ saved [277791/277791]

--2020-02-19 06:24:04--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-luo.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.192.133, 151.101.128.133, 151.101.64.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.192.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205584 (201K) [text/plain]
Saving to: ‘test.en-luo.en’


2020-02-19

In [28]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = "test.en-any.en"
j = 0
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 3571 global test sentences to filter from the training/dev data.


In [29]:
import pandas as pd

# TMX file to dataframe
source_file = 'jw300.' + source_language
target_file = 'jw300.' + target_language

source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set.
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            skip_lines.append(i)             
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in skip_lines:
            target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
df.head(3)

Loaded data and skipped 4654/154941 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,Compassion in a Cruel World,Kecho Ji e Piny ma Ji Ok Dew Jowetegi
1,A MAN in Burundi falls seriously ill with mala...,MALARIA mager ogoyo dichwo moro e piny Burundi .
2,He urgently needs to be transferred to a hospi...,Dwarore otere e osiptal mapiyo ahinya .


## Pre-processing and export

It is generally a good idea to remove duplicate translations and conflicting translations from the corpus. In practice, these public corpora include some number of these that need to be cleaned.

In addition we will split our data into dev/test/train and export to the filesystem.

In [30]:
import numpy as np
# drop duplicate translations
df_pp = df.drop_duplicates()

#drop empty lines (alp)
df_pp['source_sentence'].replace('', np.nan, inplace=True)
df_pp['target_sentence'].replace('', np.nan, inplace=True)
df_pp.dropna(subset=['source_sentence'], inplace=True)
df_pp.dropna(subset=['target_sentence'], inplace=True)

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=seed).reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


In [31]:
# This section does the split between train/dev for the parallel corpora then saves them as separate files
# We use 1000 dev test and the given test set.
import csv

# Do the split between dev/train and create parallel corpora
num_dev_patterns = 1000

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
if lc:  # Julia: making lowercasing optional
    df_pp["source_sentence"] = df_pp["source_sentence"].str.lower()
    df_pp["target_sentence"] = df_pp["target_sentence"].str.lower()

# Julia: test sets are already generated
dev = df_pp.tail(num_dev_patterns) # Herman: Error in original
stripped = df_pp.drop(df_pp.tail(num_dev_patterns).index)

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
  for index, row in stripped.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

#stripped[["source_sentence"]].to_csv("train."+source_language, header=False, index=False)  # Herman: Added `header=False` everywhere
#stripped[["target_sentence"]].to_csv("train."+target_language, header=False, index=False)  # Julia: Problematic handling of quotation marks.

#dev[["source_sentence"]].to_csv("dev."+source_language, header=False, index=False)
#dev[["target_sentence"]].to_csv("dev."+target_language, header=False, index=False)


# Doublecheck the format below. There should be no extra quotation marks or weird characters.
! head train.*
! head dev.*

==> train.bpe.en <==
Paul wrote : “ Let the su@@ n not set with you in a pro@@ v@@ ok@@ ed st@@ ate . ”
Some of these en@@ emi@@ es as@@ sa@@ ul@@ t Jehovah’s Witnesses physi@@ c@@ ally .
S@@ T@@ U@@ D@@ Y A@@ R@@ TI@@ C@@ L@@ E@@ S 3 , 4 PA@@ G@@ E@@ S 19 - 28
From inf@@ an@@ cy , children can be@@ gin lear@@ ning about Jehovah’s Word .
In these ways , youn@@ ger ones and those ne@@ w@@ ly associ@@ ated learn to per@@ form ac@@ ts of kindness for others .
O@@ f@@ fered S@@ el@@ ves in W@@ est A@@ fri@@ c@@ a , 1 / 15
Those resurrec@@ ted to heaven will even@@ tually number 14@@ 4@@ ,000 .
A@@ ga@@ inst the bac@@ k@@ d@@ ro@@ p of Satan’s dis@@ as@@ tr@@ ous rule , Jehovah’s perfect qualities are even more ob@@ vi@@ ous than they o@@ ther@@ wise might have been .
The first Kingdom H@@ all in B@@ a@@ uru ​ — a r@@ ented place with a sig@@ n I pa@@ inted , 195@@ 5
In ad@@ dition , his s@@ mok@@ ing end@@ ang@@ ers the health of those around him .

==> train.bpe.luo <==
Paulo nondiko kama



---


## Installation of JoeyNMT

JoeyNMT is a simple, minimalist NMT package which is useful for learning and teaching. Check out the documentation for JoeyNMT [here](https://joeynmt.readthedocs.io)  

In [32]:
# Install JoeyNMT
#! git clone https://github.com/joeynmt/joeynmt.git
#! cd joeynmt; pip3 install .

# Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for agglutinative languages (a feature of most Bantu languages) is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- It was also shown that by optimizing the umber of BPE codes we significantly improve results for low-resourced languages [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021) [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

- Below we have the scripts for doing BPE tokenization of our data. We use 4000 tokens as recommended by [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021). You do not need to change anything. Simply running the below will be suitable. 

In [33]:
# One of the huge boosts in NMT performance was to use a different method of tokenizing. 
# Usually, NMT would tokenize by words. However, using a method called BPE gave amazing boosts to performance

# Do subword NMT
from os import path
os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language

# Learn BPEs on the training data.
os.environ["data_path"] = path.join("../../joeynmt", "data", source_language + target_language) # Herman! 
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

In [34]:
# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt

In [35]:
! sudo chmod 777 ../../joeynmt/data/

In [36]:
# Create directory, move everyone we care about to the correct location
! mkdir -p $data_path
! cp train.* $data_path
! cp test.* $data_path
! cp dev.* $data_path
! cp bpe.codes.4000 $data_path
! ls $data_path

bpe.codes.4000	dev.en	     test.bpe.luo    test.luo	    train.en
dev.bpe.en	dev.luo      test.en	     train.bpe.en   train.luo
dev.bpe.luo	test.bpe.en  test.en-any.en  train.bpe.luo  vocab.txt


In [37]:
# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
# ! cp train.* "$gdrive_path"
# ! cp test.* "$gdrive_path"
# ! cp dev.* "$gdrive_path"
# ! cp bpe.codes.4000 "$gdrive_path"
# ! ls "$gdrive_path"

In [38]:
# Create that vocab using build_vocab
! sudo chmod 777 ../../joeynmt/scripts/build_vocab.py
! ../../joeynmt/scripts/build_vocab.py ../../joeynmt/data/$src$tgt/train.bpe.$src ../../joeynmt/data/$src$tgt/train.bpe.$tgt --output_path ../../joeynmt/data/$src$tgt/vocab.txt

In [39]:
# Some output
! echo "BPE Luo Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 ../../joeynmt/data/$src$tgt/vocab.txt  # Herman

BPE Luo Sentences
O@@ k@@ um@@ ba malach mar yie ( Ne paragraf mar 12 - 14 )
O@@ g@@ ud@@ u mar war@@ ruok ( Ne paragraf mar 15 - 18 )
A@@ se@@ fwenyo ni ji chiko it@@ gi sama gineno ni i@@ hero weche manie Muma , kendo itimo duto ma inyalo mondo i@@ kony@@ gi . ”
L@@ ig@@ ang@@ la mar roho maler ( Ne paragraf mar 19 - 20 )
Kata kamano , kokalo kuom teko mar Jehova wanyalo k@@ wede !
Combined BPE Vocab
portunity
aburi
Q
.E
Ç@@
Ł@@
ʺ
rans@@
oura@@
_


In [40]:
# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
#! cp train.* "$gdrive_path"
#! cp test.* "$gdrive_path"
#! cp dev.* "$gdrive_path"
#! cp bpe.codes.4000 "$gdrive_path"
#! ls "$gdrive_path"

# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in  [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [41]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '%s%s' % (source_language, target_language)
# gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.bpe"
    dev:   "data/{name}/dev.bpe"
    test:  "data/{name}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "models/{name}_transformer/1000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 1000          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="n/a", source_language=source_language, target_language=target_language)
with open("../../joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [42]:
! ls -la ../../joeynmt

total 136
drwxr-xr-x 11 root root  4096 Oct 25 12:22 .
drwxrwxrwx 11 root root  4096 Feb 19 06:15 ..
drwxr-xr-x  8 root root  4096 Oct 24 15:14 .git
-rw-r--r--  1 root root    49 Oct 24 15:14 .gitattributes
drwxr-xr-x  3 root root  4096 Oct 24 15:14 .github
-rw-r--r--  1 root root    71 Oct 24 15:14 .gitignore
-rw-r--r--  1 root root 13514 Oct 24 15:14 .pylintrc
-rw-r--r--  1 root root   159 Oct 24 15:14 .readthedocs.yml
-rw-r--r--  1 root root   542 Oct 24 15:14 .travis.yml
-rwxrw-rwx  1 root root  3354 Oct 24 15:14 CODE_OF_CONDUCT.md
-rwxrw-rwx  1 root root  1071 Oct 24 15:14 LICENSE
-rwxrw-rwx  1 root root 13286 Oct 24 15:14 README.md
-rwxrw-rwx  1 root root  8229 Oct 24 15:14 benchmarks.md
drwxrw-rwx  3 root root  4096 Feb 19 06:30 configs
drwxrwxrwx  8 root root  4096 Feb 19 06:35 data
drwxrw-rwx  4 root root  4096 Oct 24 15:14 docs
-rwxrw-rwx  1 root root 14373 Oct 24 15:14 joey-small.png
drwxrw-rwx  3 root root  4096 Oct 24 16:35 joeynmt
drwxrwxrwx  7 root roo

In [43]:
! sudo chmod 777 ../../joeynmt/models

In [44]:
# Train the model
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
! cd ../../joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
2020-02-20 12:03:17,437 Hello! This is Joey-NMT.
2020-02-20 12:03:17,444 Total params: 12144896
2020-02-20 12:03:17,445 Trainable parameters: ['decoder.layer_norm.bias', 'decoder.layer_norm.weight', 'decoder.layers.0.dec_layer_norm.bias', 'decoder.layers.0.dec_layer_norm.weight', 'decoder.layers.0.feed_forward.layer_norm.bias', 'decoder.layers.0.feed_forward.layer_norm.weight', 'decoder.layers.0.feed_forward.pwff_layer.0.bias', 'decoder.layers.0.feed_forward.pwff_layer.0.weight', 'decoder.layers.0.feed_forward.pwff_layer.3.bias', 'decoder.layers.0.feed_forward.pwff_layer.3.weight', 'decoder.layers.0.src_trg_att.k_layer.bias', 'decoder.layers.0.src_trg_att.k_layer.weight', 'decoder.l

2020-02-20 12:03:20,118 cfg.name                           : enluo_transformer
2020-02-20 12:03:20,118 cfg.data.src                       : en
2020-02-20 12:03:20,118 cfg.data.trg                       : luo
2020-02-20 12:03:20,118 cfg.data.train                     : data/enluo/train.bpe
2020-02-20 12:03:20,118 cfg.data.dev                       : data/enluo/dev.bpe
2020-02-20 12:03:20,118 cfg.data.test                      : data/enluo/test.bpe
2020-02-20 12:03:20,118 cfg.data.level                     : bpe
2020-02-20 12:03:20,118 cfg.data.lowercase                 : False
2020-02-20 12:03:20,118 cfg.data.max_sent_length           : 100
2020-02-20 12:03:20,118 cfg.data.src_vocab                 : data/enluo/vocab.txt
2020-02-20 12:03:20,118 cfg.data.trg_vocab                 : data/enluo/vocab.txt
2020-02-20 12:03:20,118 cfg.testing.beam_size              : 5
2020-02-20 12:03:20,118 cfg.testing.alpha                  : 1.0
2020-02-20 12:03:20,119 cfg.training.random_seed           :

2020-02-20 12:10:07,191 Epoch   1 Step:     1100 Batch Loss:     3.762488 Tokens per Sec:     7596, Lr: 0.000300
2020-02-20 12:10:36,062 Epoch   1 Step:     1200 Batch Loss:     4.396629 Tokens per Sec:     7597, Lr: 0.000300
2020-02-20 12:11:04,709 Epoch   1 Step:     1300 Batch Loss:     4.256168 Tokens per Sec:     7514, Lr: 0.000300
2020-02-20 12:11:33,621 Epoch   1 Step:     1400 Batch Loss:     3.945779 Tokens per Sec:     7441, Lr: 0.000300
2020-02-20 12:12:02,358 Epoch   1 Step:     1500 Batch Loss:     3.935235 Tokens per Sec:     7623, Lr: 0.000300
2020-02-20 12:12:07,028 Epoch   1: total training loss 6731.63
2020-02-20 12:12:07,029 EPOCH 2
2020-02-20 12:12:31,139 Epoch   2 Step:     1600 Batch Loss:     3.035681 Tokens per Sec:     7441, Lr: 0.000300
2020-02-20 12:12:59,846 Epoch   2 Step:     1700 Batch Loss:     3.587291 Tokens per Sec:     7593, Lr: 0.000300
2020-02-20 12:13:28,871 Epoch   2 Step:     1800 Batch Loss:     3.648604 Tokens per Sec:     7519, Lr: 0.000300
2

2020-02-20 12:28:56,806 Epoch   3 Step:     4100 Batch Loss:     2.952160 Tokens per Sec:     7502, Lr: 0.000300
2020-02-20 12:29:25,371 Epoch   3 Step:     4200 Batch Loss:     3.406082 Tokens per Sec:     7499, Lr: 0.000300
2020-02-20 12:29:54,297 Epoch   3 Step:     4300 Batch Loss:     3.000547 Tokens per Sec:     7585, Lr: 0.000300
2020-02-20 12:30:22,684 Epoch   3 Step:     4400 Batch Loss:     3.101769 Tokens per Sec:     7585, Lr: 0.000300
2020-02-20 12:30:51,661 Epoch   3 Step:     4500 Batch Loss:     2.918369 Tokens per Sec:     7768, Lr: 0.000300
2020-02-20 12:31:07,842 Epoch   3: total training loss 4516.30
2020-02-20 12:31:07,843 EPOCH 4
2020-02-20 12:31:20,580 Epoch   4 Step:     4600 Batch Loss:     2.786486 Tokens per Sec:     7292, Lr: 0.000300
2020-02-20 12:31:49,217 Epoch   4 Step:     4700 Batch Loss:     2.668926 Tokens per Sec:     7503, Lr: 0.000300
2020-02-20 12:32:18,079 Epoch   4 Step:     4800 Batch Loss:     2.563730 Tokens per Sec:     7651, Lr: 0.000300
2

2020-02-20 12:47:46,809 Epoch   5 Step:     7100 Batch Loss:     2.377353 Tokens per Sec:     7536, Lr: 0.000300
2020-02-20 12:48:15,570 Epoch   5 Step:     7200 Batch Loss:     1.793351 Tokens per Sec:     7642, Lr: 0.000300
2020-02-20 12:48:44,278 Epoch   5 Step:     7300 Batch Loss:     2.368781 Tokens per Sec:     7535, Lr: 0.000300
2020-02-20 12:49:12,767 Epoch   5 Step:     7400 Batch Loss:     3.025920 Tokens per Sec:     7526, Lr: 0.000300
2020-02-20 12:49:41,451 Epoch   5 Step:     7500 Batch Loss:     2.123299 Tokens per Sec:     7625, Lr: 0.000300
2020-02-20 12:50:10,079 Epoch   5 Step:     7600 Batch Loss:     2.233503 Tokens per Sec:     7551, Lr: 0.000300
2020-02-20 12:50:10,224 Epoch   5: total training loss 3924.77
2020-02-20 12:50:10,224 EPOCH 6
2020-02-20 12:50:38,666 Epoch   6 Step:     7700 Batch Loss:     2.196222 Tokens per Sec:     7480, Lr: 0.000300
2020-02-20 12:51:07,374 Epoch   6 Step:     7800 Batch Loss:     2.038606 Tokens per Sec:     7539, Lr: 0.000300
2

2020-02-20 13:06:36,315 Epoch   7 Step:    10100 Batch Loss:     2.194897 Tokens per Sec:     7569, Lr: 0.000300
2020-02-20 13:07:04,902 Epoch   7 Step:    10200 Batch Loss:     2.210200 Tokens per Sec:     7497, Lr: 0.000300
2020-02-20 13:07:33,669 Epoch   7 Step:    10300 Batch Loss:     1.960394 Tokens per Sec:     7572, Lr: 0.000300
2020-02-20 13:08:02,569 Epoch   7 Step:    10400 Batch Loss:     2.159416 Tokens per Sec:     7490, Lr: 0.000300
2020-02-20 13:08:31,622 Epoch   7 Step:    10500 Batch Loss:     2.274217 Tokens per Sec:     7653, Lr: 0.000300
2020-02-20 13:09:00,132 Epoch   7 Step:    10600 Batch Loss:     2.411942 Tokens per Sec:     7477, Lr: 0.000300
2020-02-20 13:09:13,505 Epoch   7: total training loss 3569.71
2020-02-20 13:09:13,506 EPOCH 8
2020-02-20 13:09:28,869 Epoch   8 Step:    10700 Batch Loss:     2.155770 Tokens per Sec:     7371, Lr: 0.000300
2020-02-20 13:09:57,690 Epoch   8 Step:    10800 Batch Loss:     1.774834 Tokens per Sec:     7595, Lr: 0.000300
2

2020-02-20 13:25:26,199 Epoch   9 Step:    13100 Batch Loss:     1.966934 Tokens per Sec:     7477, Lr: 0.000300
2020-02-20 13:25:54,608 Epoch   9 Step:    13200 Batch Loss:     2.030701 Tokens per Sec:     7424, Lr: 0.000300
2020-02-20 13:26:23,230 Epoch   9 Step:    13300 Batch Loss:     2.384551 Tokens per Sec:     7609, Lr: 0.000300
2020-02-20 13:26:52,018 Epoch   9 Step:    13400 Batch Loss:     2.075567 Tokens per Sec:     7556, Lr: 0.000300
2020-02-20 13:27:20,908 Epoch   9 Step:    13500 Batch Loss:     2.342129 Tokens per Sec:     7744, Lr: 0.000300
2020-02-20 13:27:50,110 Epoch   9 Step:    13600 Batch Loss:     2.523449 Tokens per Sec:     7784, Lr: 0.000300
2020-02-20 13:28:13,814 Epoch   9: total training loss 3318.59
2020-02-20 13:28:13,814 EPOCH 10
2020-02-20 13:28:19,276 Epoch  10 Step:    13700 Batch Loss:     2.016206 Tokens per Sec:     7363, Lr: 0.000300
2020-02-20 13:28:48,178 Epoch  10 Step:    13800 Batch Loss:     2.267726 Tokens per Sec:     7665, Lr: 0.000300


2020-02-20 13:44:15,485 Epoch  11 Step:    16100 Batch Loss:     1.779898 Tokens per Sec:     7723, Lr: 0.000300
2020-02-20 13:44:44,008 Epoch  11 Step:    16200 Batch Loss:     1.693069 Tokens per Sec:     7414, Lr: 0.000300
2020-02-20 13:45:12,976 Epoch  11 Step:    16300 Batch Loss:     2.256828 Tokens per Sec:     7699, Lr: 0.000300
2020-02-20 13:45:41,342 Epoch  11 Step:    16400 Batch Loss:     2.473014 Tokens per Sec:     7479, Lr: 0.000300
2020-02-20 13:46:10,144 Epoch  11 Step:    16500 Batch Loss:     1.979323 Tokens per Sec:     7733, Lr: 0.000300
2020-02-20 13:46:38,915 Epoch  11 Step:    16600 Batch Loss:     2.123430 Tokens per Sec:     7579, Lr: 0.000300
2020-02-20 13:47:07,672 Epoch  11 Step:    16700 Batch Loss:     1.495558 Tokens per Sec:     7603, Lr: 0.000300
2020-02-20 13:47:13,399 Epoch  11: total training loss 3163.59
2020-02-20 13:47:13,400 EPOCH 12
2020-02-20 13:47:36,731 Epoch  12 Step:    16800 Batch Loss:     1.909336 Tokens per Sec:     7616, Lr: 0.000300


2020-02-20 14:03:03,626 Epoch  13 Step:    19100 Batch Loss:     2.026862 Tokens per Sec:     7638, Lr: 0.000300
2020-02-20 14:03:32,593 Epoch  13 Step:    19200 Batch Loss:     2.185729 Tokens per Sec:     7646, Lr: 0.000300
2020-02-20 14:04:01,428 Epoch  13 Step:    19300 Batch Loss:     1.903270 Tokens per Sec:     7637, Lr: 0.000300
2020-02-20 14:04:29,793 Epoch  13 Step:    19400 Batch Loss:     2.150892 Tokens per Sec:     7482, Lr: 0.000300
2020-02-20 14:04:58,275 Epoch  13 Step:    19500 Batch Loss:     1.806008 Tokens per Sec:     7577, Lr: 0.000300
2020-02-20 14:05:27,178 Epoch  13 Step:    19600 Batch Loss:     1.681479 Tokens per Sec:     7557, Lr: 0.000300
2020-02-20 14:05:55,988 Epoch  13 Step:    19700 Batch Loss:     1.494733 Tokens per Sec:     7634, Lr: 0.000300
2020-02-20 14:06:12,684 Epoch  13: total training loss 3037.68
2020-02-20 14:06:12,684 EPOCH 14
2020-02-20 14:06:24,718 Epoch  14 Step:    19800 Batch Loss:     1.871077 Tokens per Sec:     7269, Lr: 0.000300


2020-02-20 14:21:52,230 Epoch  15 Step:    22100 Batch Loss:     1.829914 Tokens per Sec:     7648, Lr: 0.000300
2020-02-20 14:22:21,127 Epoch  15 Step:    22200 Batch Loss:     2.466875 Tokens per Sec:     7643, Lr: 0.000300
2020-02-20 14:22:50,008 Epoch  15 Step:    22300 Batch Loss:     1.900896 Tokens per Sec:     7610, Lr: 0.000300
2020-02-20 14:23:18,907 Epoch  15 Step:    22400 Batch Loss:     1.941167 Tokens per Sec:     7638, Lr: 0.000300
2020-02-20 14:23:47,500 Epoch  15 Step:    22500 Batch Loss:     2.090586 Tokens per Sec:     7550, Lr: 0.000300
2020-02-20 14:24:16,126 Epoch  15 Step:    22600 Batch Loss:     2.041277 Tokens per Sec:     7500, Lr: 0.000300
2020-02-20 14:24:44,754 Epoch  15 Step:    22700 Batch Loss:     2.085501 Tokens per Sec:     7495, Lr: 0.000300
2020-02-20 14:25:13,195 Epoch  15 Step:    22800 Batch Loss:     2.269855 Tokens per Sec:     7553, Lr: 0.000300
2020-02-20 14:25:13,805 Epoch  15: total training loss 2935.06
2020-02-20 14:25:13,806 EPOCH 16


2020-02-20 14:40:41,844 Epoch  17 Step:    25100 Batch Loss:     1.930991 Tokens per Sec:     7626, Lr: 0.000300
2020-02-20 14:41:10,728 Epoch  17 Step:    25200 Batch Loss:     1.790509 Tokens per Sec:     7552, Lr: 0.000300
2020-02-20 14:41:39,542 Epoch  17 Step:    25300 Batch Loss:     1.659317 Tokens per Sec:     7537, Lr: 0.000300
2020-02-20 14:42:08,450 Epoch  17 Step:    25400 Batch Loss:     1.727902 Tokens per Sec:     7537, Lr: 0.000300
2020-02-20 14:42:37,179 Epoch  17 Step:    25500 Batch Loss:     2.255066 Tokens per Sec:     7499, Lr: 0.000300
2020-02-20 14:43:06,080 Epoch  17 Step:    25600 Batch Loss:     1.129977 Tokens per Sec:     7543, Lr: 0.000300
2020-02-20 14:43:35,073 Epoch  17 Step:    25700 Batch Loss:     1.938497 Tokens per Sec:     7640, Lr: 0.000300
2020-02-20 14:44:04,141 Epoch  17 Step:    25800 Batch Loss:     1.842105 Tokens per Sec:     7640, Lr: 0.000300
2020-02-20 14:44:15,739 Epoch  17: total training loss 2840.30
2020-02-20 14:44:15,740 EPOCH 18


2020-02-20 14:59:31,948 Epoch  19 Step:    28100 Batch Loss:     1.910940 Tokens per Sec:     7711, Lr: 0.000300
2020-02-20 15:00:00,620 Epoch  19 Step:    28200 Batch Loss:     1.857561 Tokens per Sec:     7624, Lr: 0.000300
2020-02-20 15:00:29,051 Epoch  19 Step:    28300 Batch Loss:     1.970976 Tokens per Sec:     7407, Lr: 0.000300
2020-02-20 15:00:58,128 Epoch  19 Step:    28400 Batch Loss:     1.691468 Tokens per Sec:     7570, Lr: 0.000300
2020-02-20 15:01:26,796 Epoch  19 Step:    28500 Batch Loss:     1.886670 Tokens per Sec:     7515, Lr: 0.000300
2020-02-20 15:01:55,426 Epoch  19 Step:    28600 Batch Loss:     2.225640 Tokens per Sec:     7532, Lr: 0.000300
2020-02-20 15:02:24,377 Epoch  19 Step:    28700 Batch Loss:     1.406204 Tokens per Sec:     7593, Lr: 0.000300
2020-02-20 15:02:53,014 Epoch  19 Step:    28800 Batch Loss:     1.695791 Tokens per Sec:     7448, Lr: 0.000300
2020-02-20 15:03:18,084 Epoch  19: total training loss 2790.81
2020-02-20 15:03:18,085 EPOCH 20


2020-02-20 15:18:19,622 Epoch  21 Step:    31100 Batch Loss:     1.819182 Tokens per Sec:     7524, Lr: 0.000300
2020-02-20 15:18:48,484 Epoch  21 Step:    31200 Batch Loss:     1.671183 Tokens per Sec:     7697, Lr: 0.000300
2020-02-20 15:19:17,193 Epoch  21 Step:    31300 Batch Loss:     1.073250 Tokens per Sec:     7488, Lr: 0.000300
2020-02-20 15:19:45,994 Epoch  21 Step:    31400 Batch Loss:     1.645528 Tokens per Sec:     7604, Lr: 0.000300
2020-02-20 15:20:14,749 Epoch  21 Step:    31500 Batch Loss:     1.817031 Tokens per Sec:     7573, Lr: 0.000300
2020-02-20 15:20:43,542 Epoch  21 Step:    31600 Batch Loss:     1.894629 Tokens per Sec:     7575, Lr: 0.000300
2020-02-20 15:21:12,418 Epoch  21 Step:    31700 Batch Loss:     1.885237 Tokens per Sec:     7576, Lr: 0.000300
2020-02-20 15:21:41,376 Epoch  21 Step:    31800 Batch Loss:     1.353623 Tokens per Sec:     7636, Lr: 0.000300
2020-02-20 15:22:10,286 Epoch  21 Step:    31900 Batch Loss:     1.591935 Tokens per Sec:     75

2020-02-20 15:37:08,695 Epoch  23 Step:    34100 Batch Loss:     1.914464 Tokens per Sec:     7584, Lr: 0.000300
2020-02-20 15:37:37,404 Epoch  23 Step:    34200 Batch Loss:     1.760642 Tokens per Sec:     7584, Lr: 0.000300
2020-02-20 15:38:06,019 Epoch  23 Step:    34300 Batch Loss:     1.812148 Tokens per Sec:     7586, Lr: 0.000300
2020-02-20 15:38:34,760 Epoch  23 Step:    34400 Batch Loss:     1.241874 Tokens per Sec:     7601, Lr: 0.000300
2020-02-20 15:39:03,539 Epoch  23 Step:    34500 Batch Loss:     1.844906 Tokens per Sec:     7624, Lr: 0.000300
2020-02-20 15:39:32,884 Epoch  23 Step:    34600 Batch Loss:     1.787987 Tokens per Sec:     7863, Lr: 0.000300
2020-02-20 15:40:01,826 Epoch  23 Step:    34700 Batch Loss:     1.775381 Tokens per Sec:     7588, Lr: 0.000300
2020-02-20 15:40:30,810 Epoch  23 Step:    34800 Batch Loss:     1.828186 Tokens per Sec:     7622, Lr: 0.000300
2020-02-20 15:40:59,724 Epoch  23 Step:    34900 Batch Loss:     1.916007 Tokens per Sec:     75

2020-02-20 15:55:30,863 	Reference:  Noyudo Dunn osendiko wach “ loso gik moko duto odok makare , kaka Nyasaye nowacho gi dho jonabi duto maler nyaka a chakruok piny . ”
2020-02-20 15:55:30,863 	Hypothesis: Noyudo osendiko e wi “ gik moko duto ma Nyasaye nowuoyo kuom dho jonabi machon . ”
2020-02-20 15:55:30,863 Validation result at epoch  25, step    37000: bleu:  21.67, loss: 42305.6289, ppl:   5.5644, duration: 88.3971s
2020-02-20 15:55:59,189 Epoch  25 Step:    37100 Batch Loss:     1.815181 Tokens per Sec:     7414, Lr: 0.000300
2020-02-20 15:56:28,284 Epoch  25 Step:    37200 Batch Loss:     1.838013 Tokens per Sec:     7600, Lr: 0.000300
2020-02-20 15:56:56,996 Epoch  25 Step:    37300 Batch Loss:     1.826828 Tokens per Sec:     7536, Lr: 0.000300
2020-02-20 15:57:25,629 Epoch  25 Step:    37400 Batch Loss:     1.523698 Tokens per Sec:     7541, Lr: 0.000300
2020-02-20 15:57:54,631 Epoch  25 Step:    37500 Batch Loss:     1.658947 Tokens per Sec:     7609, Lr: 0.000300
2020-02-

2020-02-20 16:14:49,104 Epoch  27 Step:    40100 Batch Loss:     1.829319 Tokens per Sec:     7649, Lr: 0.000300
2020-02-20 16:15:17,967 Epoch  27 Step:    40200 Batch Loss:     1.796776 Tokens per Sec:     7594, Lr: 0.000300
2020-02-20 16:15:47,029 Epoch  27 Step:    40300 Batch Loss:     1.575827 Tokens per Sec:     7586, Lr: 0.000300
2020-02-20 16:16:15,906 Epoch  27 Step:    40400 Batch Loss:     1.830190 Tokens per Sec:     7541, Lr: 0.000300
2020-02-20 16:16:44,727 Epoch  27 Step:    40500 Batch Loss:     1.579970 Tokens per Sec:     7617, Lr: 0.000300
2020-02-20 16:17:13,546 Epoch  27 Step:    40600 Batch Loss:     1.676199 Tokens per Sec:     7521, Lr: 0.000300
2020-02-20 16:17:42,689 Epoch  27 Step:    40700 Batch Loss:     1.726659 Tokens per Sec:     7699, Lr: 0.000300
2020-02-20 16:18:11,558 Epoch  27 Step:    40800 Batch Loss:     1.675419 Tokens per Sec:     7484, Lr: 0.000300
2020-02-20 16:18:40,394 Epoch  27 Step:    40900 Batch Loss:     1.705130 Tokens per Sec:     74

2020-02-20 16:33:39,037 Epoch  29 Step:    43100 Batch Loss:     1.562337 Tokens per Sec:     7495, Lr: 0.000300
2020-02-20 16:34:07,539 Epoch  29 Step:    43200 Batch Loss:     1.512897 Tokens per Sec:     7540, Lr: 0.000300
2020-02-20 16:34:36,579 Epoch  29 Step:    43300 Batch Loss:     1.912865 Tokens per Sec:     7742, Lr: 0.000300
2020-02-20 16:35:05,179 Epoch  29 Step:    43400 Batch Loss:     1.795569 Tokens per Sec:     7543, Lr: 0.000300
2020-02-20 16:35:34,011 Epoch  29 Step:    43500 Batch Loss:     1.239226 Tokens per Sec:     7626, Lr: 0.000300
2020-02-20 16:36:02,734 Epoch  29 Step:    43600 Batch Loss:     1.947909 Tokens per Sec:     7432, Lr: 0.000300
2020-02-20 16:36:31,516 Epoch  29 Step:    43700 Batch Loss:     1.863791 Tokens per Sec:     7587, Lr: 0.000300
2020-02-20 16:37:00,585 Epoch  29 Step:    43800 Batch Loss:     1.787845 Tokens per Sec:     7634, Lr: 0.000300
2020-02-20 16:37:29,387 Epoch  29 Step:    43900 Batch Loss:     1.920231 Tokens per Sec:     76

In [0]:
# Copy the created models from the notebook storage to google drive for persistant storage 
!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

In [45]:
# Output our validation accuracy
! cat "../../joeynmt/models/${src}${tgt}_transformer/validations.txt"

Steps: 1000	Loss: 98175.97656	PPL: 53.68433	bleu: 1.00511	LR: 0.00030000	*
Steps: 2000	Loss: 83177.78906	PPL: 29.21336	bleu: 2.74320	LR: 0.00030000	*
Steps: 3000	Loss: 75435.21094	PPL: 21.33823	bleu: 4.85840	LR: 0.00030000	*
Steps: 4000	Loss: 69503.42969	PPL: 16.77417	bleu: 7.70259	LR: 0.00030000	*
Steps: 5000	Loss: 65853.02344	PPL: 14.46510	bleu: 9.55595	LR: 0.00030000	*
Steps: 6000	Loss: 63031.21875	PPL: 12.90035	bleu: 10.99594	LR: 0.00030000	*
Steps: 7000	Loss: 60268.74609	PPL: 11.53260	bleu: 11.95757	LR: 0.00030000	*
Steps: 8000	Loss: 58475.81250	PPL: 10.72349	bleu: 12.97758	LR: 0.00030000	*
Steps: 9000	Loss: 57032.02734	PPL: 10.11339	bleu: 13.53204	LR: 0.00030000	*
Steps: 10000	Loss: 54667.55078	PPL: 9.18829	bleu: 15.21627	LR: 0.00030000	*
Steps: 11000	Loss: 53643.98438	PPL: 8.81454	bleu: 15.05572	LR: 0.00030000	*
Steps: 12000	Loss: 52487.89453	PPL: 8.41065	bleu: 15.88412	LR: 0.00030000	*
Steps: 13000	Loss: 51491.73047	PPL: 8.07751	bleu: 16.26519	LR: 0.00030000	*
Step

In [46]:
# Test our model
! cd ../../joeynmt; python3 -m joeynmt test "models/${src}${tgt}_transformer/config.yaml"

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
2020-02-20 16:51:25,414 -  dev bleu:  23.22 [Beam search decoding with beam size = 5 and alpha = 1.0]
2020-02-20 16:52:45,303 - test bleu:  32.64 [Beam search decoding with beam size = 5 and alpha = 1.0]
