{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Igc5itf-xMGj" }, "source": [ "# Masakhane - Machine Translation for African Languages (Using JoeyNMT)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "x4fXCKCf36IK" }, "source": [ "## English - Tigrinya\n", "\n", "### Data\n", "A mix of corpora is used: \n", "- JW300\n", "- Tatoeba\n", "- [Parallel Corpora for Ethiopian Languages](https://github.com/AAUThematic4LT/Parallel-Corpora-for-Ethiopian-Languages)\n", "\n", "Test set is retrieved from masakhane global JW300 test set. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "l929HimrxS0a" }, "source": [ "## Retrieve your data & make a parallel corpus\n", "\n", "If you are wanting to use the JW300 data referenced on the Masakhane website or in our GitHub repo, you can use `opus-tools` to convert the data into a convenient format. `opus_read` from that package provides a convenient tool for reading the native aligned XML files and to convert them to TMX format. The tool can also be used to fetch relevant files from OPUS on the fly and to filter the data as necessary. [Read the documentation](https://pypi.org/project/opustools-pkg/) for more details.\n", "\n", "Once you have your corpus files in TMX format (an xml structure which will include the sentences in your target language and your source language in a single file), we recommend reading them into a pandas dataframe. Thankfully, Jade wrote a silly `tmx2dataframe` package which converts your tmx file to a pandas dataframe. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 124 }, "colab_type": "code", "id": "oGRmDELn7Az0", "outputId": "4cf88906-25db-48d8-db25-6395cc4dc5d3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n", "\n", "Enter your authorization code:\n", "··········\n", "Mounted at /content/drive\n" ] } ], "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "Cn3tgQLzUxwn" }, "outputs": [], "source": [ "# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:\n", "# These will also become the suffix's of all vocab and corpus files used throughout\n", "import os\n", "source_language = \"en\"\n", "target_language = \"ti\" \n", "lc = True # If True, lowercase the data.\n", "seed = 42 # Random seed for shuffling.\n", "tag = \"tigmix\" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted\n", "\n", "os.environ[\"src\"] = source_language # Sets them in bash as well, since we often use bash scripts\n", "os.environ[\"tgt\"] = target_language\n", "os.environ[\"tag\"] = tag\n", "\n", "# This will save it to a folder in our gdrive instead!\n", "!mkdir -p \"/content/drive/My Drive/masakhane/$src-$tgt-$tag\"\n", "os.environ[\"gdrive_path\"] = \"/content/drive/My Drive/masakhane/%s-%s-%s\" % (source_language, target_language, tag)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "kBSgJHEw7Nvx", "outputId": "171a2695-9c9e-4bf6-a2bf-1ebd440bce6d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/content/drive/My Drive/masakhane/en-ti-tigmix\n" ] } ], "source": [ "!echo $gdrive_path" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "colab_type": "code", "id": "gA75Fs9ys8Y9", "outputId": "af082e56-690f-4395-8ef4-31740fd0fd37" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: opustools-pkg in /usr/local/lib/python3.6/dist-packages (0.0.52)\n" ] } ], "source": [ "# Install opus-tools\n", "! pip install opustools-pkg" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 697 }, "colab_type": "code", "id": "xq-tDZVks7ZD", "outputId": "161889c4-4f43-4f38-e2dc-53930bd2b59e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-ti.xml.gz not found. The following files are available for downloading:\n", "\n", " 4 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en-ti.xml.gz\n", " 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en.zip\n", " 40 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/ti.zip\n", "\n", " 307 MB Total size\n", "./JW300_latest_xml_en-ti.xml.gz ... 100% of 4 MB\n", "./JW300_latest_xml_en.zip ... 100% of 263 MB\n", "./JW300_latest_xml_ti.zip ... 100% of 40 MB\n", "--2020-01-21 10:47:26-- https://object.pouta.csc.fi/OPUS-Tatoeba/v20190709/moses/en-ti.txt.zip\n", "Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19\n", "Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 9408 (9.2K) [application/zip]\n", "Saving to: ‘en-ti.txt.zip’\n", "\n", "en-ti.txt.zip 100%[===================>] 9.19K --.-KB/s in 0s \n", "\n", "2020-01-21 10:47:27 (319 MB/s) - ‘en-ti.txt.zip’ saved [9408/9408]\n", "\n", "Archive: en-ti.txt.zip\n", " inflating: README \n", " inflating: LICENSE \n", "replace Tatoeba.en-ti.en? [y]es, [n]o, [A]ll, [N]one, [r]ename: y\n", " inflating: Tatoeba.en-ti.en \n", "replace Tatoeba.en-ti.ti? [y]es, [n]o, [A]ll, [N]one, [r]ename: y\n", " inflating: Tatoeba.en-ti.ti \n", " inflating: Tatoeba.en-ti.xml \n", "Total lines from Ethiopian corpus:\n", "36025 ethiopian.ti\n", "36025 ethiopian.en\n", "cat: tigmix.en: input file is output file\n", "cat: tigmix.ti: input file is output file\n", "Total lines Tigmix:\n", "435546 tigmix.ti\n", "435546 tigmix.en\n" ] } ], "source": [ "# Downloadi and extract JW300\n", "! opus_read -d JW300 -s $src -t $tgt -wm moses -w jw300.$src jw300.$tgt -q\n", "! gunzip JW300_latest_xml_$src-$tgt.xml.gz\n", "\n", "# Downloading other corpora\n", "# OPUS Tatoeba\n", "! wget https://object.pouta.csc.fi/OPUS-Tatoeba/v20190709/moses/en-ti.txt.zip\n", "! unzip en-ti.txt.zip\n", "! rm *.zip LICENSE README *.xml\n", "\n", "# Ethiopian languages corpus\n", "#! git clone https://github.com/AAUThematic4LT/Parallel-Corpora-for-Ethiopian-Languages.git\n", "! cat Parallel-Corpora-for-Ethiopian-Languages/Exp\\ I-English\\ to\\ Local\\ Lang/Legal/tig_eng/other.eg > ethiopian.en\n", "! cat Parallel-Corpora-for-Ethiopian-Languages/Exp\\ I-English\\ to\\ Local\\ Lang/Legal/tig_eng/other.tg > ethiopian.ti\n", "! cat Parallel-Corpora-for-Ethiopian-Languages/Exp\\ I-English\\ to\\ Local\\ Lang/Legal/tig_eng/p_eng_et.txt >> ethiopian.en\n", "! cat Parallel-Corpora-for-Ethiopian-Languages/Exp\\ I-English\\ to\\ Local\\ Lang/Legal/tig_eng/p_tig_et.txt >> ethiopian.ti\n", "! cat Parallel-Corpora-for-Ethiopian-Languages/Exp\\ I-English\\ to\\ Local\\ Lang/Spritual/jw_bible/tig_eng/p_eng_et.txt >> ethiopian.en\n", "! cat Parallel-Corpora-for-Ethiopian-Languages/Exp\\ I-English\\ to\\ Local\\ Lang/Spritual/jw_bible/tig_eng/p_tig_et.txt >> ethiopian.ti\n", "! cat Parallel-Corpora-for-Ethiopian-Languages/Exp\\ I-English\\ to\\ Local\\ Lang/Spritual/jw_daily/tig_eng/p_eng_et >> ethiopian.en\n", "! cat Parallel-Corpora-for-Ethiopian-Languages/Exp\\ I-English\\ to\\ Local\\ Lang/Spritual/jw_daily/tig_eng/p_tig_et >> ethiopian.ti\n", "\n", "print(\"Total lines from Ethiopian corpus:\")\n", "! wc -l ethiopian.ti\n", "! wc -l ethiopian.en\n", "\n", "# Merge all corpora into one dataset: Tigmix\n", "! cat *.en > tigmix.en\n", "! cat *.ti > tigmix.ti\n", "\n", "print(\"Total lines Tigmix:\")\n", "! wc -l tigmix.ti\n", "! wc -l tigmix.en" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 610 }, "colab_type": "code", "id": "n48GDRnP8y2G", "outputId": "4506c1b1-f0c1-4188-c017-68c23a249570" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2020-01-21 10:49:40-- https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 277791 (271K) [text/plain]\n", "Saving to: ‘test.en-any.en’\n", "\n", "\r", "test.en-any.en 0%[ ] 0 --.-KB/s \r", "test.en-any.en 100%[===================>] 271.28K --.-KB/s in 0.04s \n", "\n", "2020-01-21 10:49:40 (6.62 MB/s) - ‘test.en-any.en’ saved [277791/277791]\n", "\n", "--2020-01-21 10:49:41-- https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-ti.en\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 199625 (195K) [text/plain]\n", "Saving to: ‘test.en-ti.en’\n", "\n", "test.en-ti.en 100%[===================>] 194.95K --.-KB/s in 0.04s \n", "\n", "2020-01-21 10:49:42 (5.35 MB/s) - ‘test.en-ti.en’ saved [199625/199625]\n", "\n", "--2020-01-21 10:49:44-- https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-ti.ti\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 311059 (304K) [text/plain]\n", "Saving to: ‘test.en-ti.ti’\n", "\n", "test.en-ti.ti 100%[===================>] 303.77K --.-KB/s in 0.05s \n", "\n", "2020-01-21 10:49:44 (5.92 MB/s) - ‘test.en-ti.ti’ saved [311059/311059]\n", "\n" ] } ], "source": [ "# Download the global test set.\n", "! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en\n", " \n", "# And the specific test set for this language pair.\n", "os.environ[\"trg\"] = target_language \n", "os.environ[\"src\"] = source_language \n", "\n", "! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$trg.en \n", "! mv test.en-$trg.en test.en\n", "! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$trg.$trg \n", "! mv test.en-$trg.$trg test.$trg" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "NqDG-CI28y2L", "outputId": "3e2177ae-2a94-4b9a-8ac5-dcdad6e8c844" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded 3571 global test sentences to filter from the training/dev data.\n" ] } ], "source": [ "# Read the test data to filter from train and dev splits.\n", "# Store english portion in set for quick filtering checks.\n", "en_test_sents = set()\n", "filter_test_sents = \"test.en-any.en\"\n", "j = 0\n", "with open(filter_test_sents) as f:\n", " for line in f:\n", " en_test_sents.add(line.strip())\n", " j += 1\n", "print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 196 }, "colab_type": "code", "id": "3CNdwLBCfSIl", "outputId": "b0d30018-76a8-42f5-f8d3-19ad1c9df3aa" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded data and skipped 31869/435545 lines either empty or contained in test set.\n" ] }, { "data": { "text/html": [ "
\n", " | source_sentence | \n", "target_sentence | \n", "
---|---|---|
0 | \n", "Health care for asylum seekers and refugees in... | \n", "ሓልዮት ጥዕና ንደለይቲ ዑቕባን ስደተኛታትን ኣብ ስኮትላንድ | \n", "
1 | \n", "In Scotland , most health care is provided by ... | \n", "መብዛሕትኡ ክንክን ጥዕና ኣብ ስኮትላንድ ብሃገራዊ ኣገልግሎት ጥዕና ( ሃ... | \n", "
2 | \n", "Everyone who lives legally in Scotland has a r... | \n", "ሕጋዊያን ነበርቲ ስኮትላንድ ፤ ዜግነቶም ብዘየገድስ መሰል ናይ ሃ.ኣ.ጥ.... | \n", "