{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "accelerator": "GPU", "colab": { "name": "English_To_Dendi_BPE_notebook_custom_data.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true, "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.8" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Igc5itf-xMGj" }, "source": [ "# Masakhane - Machine Translation for African Languages (Using JoeyNMT)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "x4fXCKCf36IK" }, "source": [ "## Note before beginning:\n", "### - The idea is that you should be able to make minimal changes to this in order to get SOME result for your own translation corpus. \n", "\n", "### - The tl;dr: Go to the **\"TODO\"** comments which will tell you what to update to get up and running\n", "\n", "### - If you actually want to have a clue what you're doing, read the text and peek at the links\n", "\n", "### - With 100 epochs, it should take around 7 hours to run in Google Colab\n", "\n", "### - Once you've gotten a result for your language, please attach and email your notebook that generated it to masakhanetranslation@gmail.com\n", "\n", "### - If you care enough and get a chance, doing a brief background on your language would be amazing. See examples in [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)\n", "\n", "### - This notebook is intended to be used with custom parallel data. That means that you need two files, where one is in your language, the other English, and the lines in the files are corresponding translations." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "l929HimrxS0a" }, "source": [ "## Pre-process your data\n", "\n", "We assume here that you already have a data set. The format in which we will process it here requires that \n", "1. you have two files, one for each language\n", "2. the files are sentence-aligned, which means that each line should correspond to the same line in the other file.\n" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "oGRmDELn7Az0", "outputId": "0ffd4c96-92c0-4630-ce07-e03bf4bc5f66", "colab": { "base_uri": "https://localhost:8080/", "height": 124 } }, "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "text": [ "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly\n", "\n", "Enter your authorization code:\n", "··········\n", "Mounted at /content/drive\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "Cn3tgQLzUxwn", "colab": {} }, "source": [ "# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:\n", "# These will also become the suffix's of all vocab and corpus files used throughout\n", "import os\n", "source_language = \"en\"\n", "target_language = \"ddn\" \n", "lc = False # If True, lowercase the data.\n", "seed = 42 # Random seed for shuffling.\n", "tag = \"baseline\" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted\n", "\n", "os.environ[\"src\"] = source_language # Sets them in bash as well, since we often use bash scripts\n", "os.environ[\"tgt\"] = target_language\n", "os.environ[\"tag\"] = tag\n", "\n", "# This will save it to a folder in our gdrive instead!\n", "!mkdir -p \"/content/drive/My Drive/masakhane/$src-$tgt-$tag\"\n", "os.environ[\"gdrive_path\"] = \"/content/drive/My Drive/masakhane/%s-%s-%s\" % (source_language, target_language, tag)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "kBSgJHEw7Nvx", "outputId": "daf07487-72e2-425f-fd48-fa07fc942446", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "source": [ "!echo $gdrive_path" ], "execution_count": 3, "outputs": [ { "output_type": "stream", "text": [ "/content/drive/My Drive/masakhane/en-ddn-baseline\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "gA75Fs9ys8Y9", "outputId": "4cd4ac9b-6e20-4212-de29-3c2fd7c02206", "colab": { "base_uri": "https://localhost:8080/", "height": 104 } }, "source": [ "# Install opus-tools\n", "! pip install opustools-pkg" ], "execution_count": 4, "outputs": [ { "output_type": "stream", "text": [ "Collecting opustools-pkg\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/6c/9f/e829a0cceccc603450cd18e1ff80807b6237a88d9a8df2c0bb320796e900/opustools_pkg-0.0.52-py3-none-any.whl (80kB)\n", "\r\u001b[K |████ | 10kB 21.3MB/s eta 0:00:01\r\u001b[K |████████ | 20kB 2.2MB/s eta 0:00:01\r\u001b[K |████████████▏ | 30kB 3.3MB/s eta 0:00:01\r\u001b[K |████████████████▏ | 40kB 2.1MB/s eta 0:00:01\r\u001b[K |████████████████████▎ | 51kB 2.6MB/s eta 0:00:01\r\u001b[K |████████████████████████▎ | 61kB 3.1MB/s eta 0:00:01\r\u001b[K |████████████████████████████▎ | 71kB 3.6MB/s eta 0:00:01\r\u001b[K |████████████████████████████████| 81kB 3.1MB/s \n", "\u001b[?25hInstalling collected packages: opustools-pkg\n", "Successfully installed opustools-pkg-0.0.52\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "xq-tDZVks7ZD", "outputId": "8ef197f9-aab8-48a0-c808-b802103dc366", "colab": { "base_uri": "https://localhost:8080/", "height": 52 } }, "source": [ "# TODO: specify the file paths here\n", "source_file = \"/test.en\"\n", "target_file = \"/test.ddn\"\n", "\n", "# They should both have the same length.\n", "! wc -l $source_file\n", "! wc -l $target_file" ], "execution_count": 6, "outputs": [ { "output_type": "stream", "text": [ "7943 /test.en\n", "7943 /test.ddn\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "BNmkusFfGorx", "colab_type": "code", "colab": {} }, "source": [ "# TODO: Pre-processing! (OPTIONAL)\n", "\n", "# If your data contains weird symbols or the like, you might want to do some cleaning and normalization.\n", "# We don't have the code in the notebook for that, but you can use sacremoses \"normalize\" for example for normalization punctuation: https://github.com/alvations/sacremoses.\n", "\n", "# We apply tokenization to separate punctuation marks from the actual words, split words at hyphens etc.\n", "# If your data is already tokenized, that's great! Skip this cell.\n", "# Otherwise we can use sacremoses to do the tokenization for us. \n", "# We need the data to be tokenized such that it matches the global test set.\n", "\n", "#! pip install sacremoses\n", "\n", "#tok_source_file = source_file+\".tok\"\n", "#tok_target_file = target_file+\".tok\"\n", "\n", "# Tokenize the source\n", "#! sacremoses tokenize -l $source_language < $source_file > $tok_source_file\n", "# Tokenize the target\n", "#! sacremoses tokenize -l $target_language < $target_file > $tok_target_file\n", "\n", "# Let's take a look what tokenization did to the text.\n", "#! head $source_file*\n", "#! head $target_file*\n", "\n", "# Change the pointers to our files such that we continue to work with the tokenized data.\n", "#source_file = tok_source_file\n", "#target_file = tok_target_file" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "n48GDRnP8y2G", "colab": {} }, "source": [ "# Download the global test set.\n", "#! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-any.en\n", " \n", "# And the specific test set for this language pair.\n", "#os.environ[\"trg\"] = target_language \n", "#os.environ[\"src\"] = target_language \n", "\n", "#! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-$trg.en \n", "#! mv test.en-$trg.en test.en\n", "#! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-$trg.$trg \n", "#! mv test.en-$trg.$trg test.$trg\n", "\n", "# TODO: if this fails it means that there is NO test set for your language yet. It's on you to create one.\n", "# A good idea would be to take a random subset of your data, and add it to https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-any.en.\n", "# Make a Pull Request and get it approved and merged.\n", "# Then repeat this cell to retrieve the new test set.\n", "# Then proceed to the next cell that will filter out all duplicates from the training set, so that there is no overlap between training and test set." ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "NqDG-CI28y2L", "colab": {} }, "source": [ "# Read the test data to filter from train and dev splits.\n", "# Store english portion in set for quick filtering checks.\n", "#en_test_sents = set()\n", "#filter_test_sents = \"test.en-any.en\"\n", "#j = 0\n", "#with open(filter_test_sents) as f:\n", "# for line in f:\n", "# en_test_sents.add(line.strip())\n", "# j += 1\n", "#print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "3CNdwLBCfSIl", "outputId": "dd4274db-a18a-4bcd-e301-7ef2d85fb842", "colab": { "base_uri": "https://localhost:8080/", "height": 161 } }, "source": [ "import pandas as pd\n", "\n", "source = []\n", "target = []\n", "skip_lines = [] # Collect the line numbers of the source portion to skip the same lines for the target portion.\n", "with open(source_file) as f:\n", " for i, line in enumerate(f):\n", " # Skip sentences that are contained in the test set.\n", " #if line.strip() not in en_test_sents:\n", " source.append(line.strip())\n", " #else:\n", " # skip_lines.append(i) \n", "with open(target_file) as f:\n", " for j, line in enumerate(f):\n", " # Only add to corpus if corresponding source was not skipped.\n", " #if j not in skip_lines:\n", " target.append(line.strip())\n", " \n", "print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))\n", " \n", "df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])\n", "df.head(3)" ], "execution_count": 10, "outputs": [ { "output_type": "stream", "text": [ "Loaded data and skipped 0/7942 lines since contained in test set.\n" ], "name": "stdout" }, { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | source_sentence | \n", "target_sentence | \n", "
---|---|---|
0 | \n", "The book of the generation of Jesus Christ, th... | \n", "Yesu Mɛsiyɑ ko ɑ̀ ci Dɑfidi ize, Abulɛmɑ ize, ... | \n", "
1 | \n", "Abraham begat Isaac; and Isaac begat Jacob; an... | \n", "Abulɛmɑ nɑ Isɑɑkɑ hɛi. Isɑɑkɑ nɑ Yɑkɔfu hɛi. Y... | \n", "
2 | \n", "and Judah begat Perez and Zerah of Tamar; and ... | \n", "Yudɑ mo nɑ Fɑresi ndɑ Zerɑ hɛi Tɑmɑrɑ gɑɑ. Fɑr... | \n", "