English to Arabic
Author:
- Abdallah Bashir
- Amr Muhammad ALAMEEN Khalifa
Data
- The JW300 English-Arabic (bin) dataset.
- The TED-Multilingual-Parallel-Corpus English Arabic dataset
Test Data
the test data files for evaluating the model was not taken from the repo like the rest of the baselines but instead taken as a portion from the total merged datasets and in hte same size of the entries in test.en-any.en.
Model
- Default Masakhane Transformer translation model.
- Link to google drive folder with models
Analysis
The dataset requires more preprocessing to remove special characters and Scripture chapters/verse names & figures. Also it is very small, which is the primary limiting factor on being able to learn anything useful.
Example 1
Source: at the same time , the police gave free passage to busloads of mkalavishviliโs followers , who were bent on destroying the convention site .
Reference: ููู ุงูููุช ููุณู โ ูุชุญุช ุงูุดุฑุทู ุงูุทุฑูู ูุจุงุตุงุช ุงุฎุฑู ุชููู ุงุชุจุงุน ู
ูุงูุงฺคูุดฺคููู ุงูุฐูู ูุงููุง ู
ุตู
ูู ุนูู ุชุฏู
ูุฑ ู
ููุน ุงูู
ุญูู โ
Hypothesis: ููู ุงูููุช ููุณู โ ุงุนุทู ุงูุดุฑุทู ู
ูุทุน ู
ุฌุงูู ููุซูุฑ ู
ู ุงุชุจุงุน ู
ุงููุงฺูคฺูููคฺูููคูู โ ุงูุฐูู ูุงููุง ู
ูุฒุนุฌูู ูู ุชุฏู
ูุฑ ู
ููุน ุงูู
ุญูู โ
Example 2
Source: a big attraction was the man roland lithoman web - offset press that prints up to 90,000 magazines an hour .
Reference: ูู
ุง ููุช ุงูุชุจุงู ุงูุฒูุงุฑ ุงูู ุญุฏ ูุจูุฑ ูู ู
ุทุจุนู ุงููุจ ุงููุณุช ุงูู
ุชุทูุฑู ุฌุฏุง โ man roland lithomanโ โ ุงูุชู ูู
ูู ุงู ุชุทุจุน ู โูฉู ู
ุฌูู ูู ุงูุณุงุนู โ
Hypothesis: ูุงู ุฌุฐุจ ูุจูุฑ ูู ุงูุตุญุงูู ุงูุฑูู
ุงููู ููุชูู
ุงุงู โ ุงูุชู ุชุทูู ุงูู ู โูฉู ู
ุฌูู ูู ุงูุณุงุนู โ
Results
Tokenization | BLEU dev | BLEU test |
---|---|---|
BPE | 15.45 | 9.28 |