smugri3_14
The TartuNLP Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. The model can translate in 702 directions, between 27 languages.
Languages Supported
- High and Mid-Resource Languages: Estonian, English, Finnish, Hungarian, Latvian, Norwegian, Russian
- Low-Resource Finno-Ugric Languages: Komi, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Erzya, Moksha, Proper Karelian, Livvi Karelian, Ludian, Võro, Veps, Livonian, Northern Sami, Southern Sami, Inari Sami, Lule Sami, Skolt Sami, Mansi, Khanty
Usage
The model can be tested in our web demo.
To use this model for translation tasks, you will need to utilize the Fairseq v0.12.2.
Bash script example:
# Define target and source languages
src_lang="eng_Latn"
tgt_lang="kpv_Cyrl"
# Directories and paths
model_path=./smugri3_14-finno-ugric-nmt
checkpoint_path=${model_path}/smugri3_14.pt
sp_path=${model_path}/flores200_sacrebleu_tokenizer_spm.ext.model
dictionary_path=${model_path}/nllb_model_dict.ext.txt
# Language settings for fairseq
nllb_langs="eng_Latn,est_Latn,fin_Latn,hun_Latn,lvs_Latn,nob_Latn,rus_Cyrl"
new_langs="kca_Cyrl,koi_Cyrl,kpv_Cyrl,krl_Latn,liv_Latn,lud_Latn,mdf_Cyrl,mhr_Cyrl,mns_Cyrl,mrj_Cyrl,myv_Cyrl,olo_Latn,sma_Latn,sme_Latn,smj_Latn,smn_Latn,sms_Latn,udm_Cyrl,vep_Latn,vro_Latn"
# Start fairseq-interactive in interactive mode
fairseq-interactive ${model_path} \
-s ${src_lang} -t ${tgt_lang} \
--path ${checkpoint_path} --max-tokens 20000 --buffer-size 1 \
--beam 4 --lenpen 1.0 \
--bpe sentencepiece \
--remove-bpe \
--lang-tok-style multilingual \
--sentencepiece-model ${sp_path} \
--fixed-dictionary ${dictionary_path} \
--task translation_multi_simple_epoch \
--decoder-langtok --encoder-langtok src \
--lang-pairs ${src_lang}-${tgt_lang} \
--langs "${nllb_langs},${new_langs}" \
--cpu
Scores
Average:
to-lang | bleu | chrf | chrf++ |
---|---|---|---|
ru | 24.82 | 51.81 | 49.08 |
en | 28.24 | 55.91 | 53.73 |
et | 18.66 | 51.72 | 47.69 |
fi | 15.45 | 50.04 | 45.38 |
hun | 16.73 | 47.38 | 44.19 |
lv | 18.15 | 49.04 | 45.54 |
nob | 14.43 | 45.64 | 42.29 |
kpv | 10.73 | 42.34 | 38.50 |
liv | 5.16 | 29.95 | 27.28 |
mdf | 5.27 | 37.66 | 32.99 |
mhr | 8.51 | 43.42 | 38.76 |
mns | 2.45 | 27.75 | 24.03 |
mrj | 7.30 | 40.81 | 36.40 |
myv | 4.72 | 38.74 | 33.80 |
olo | 4.63 | 34.43 | 30.00 |
udm | 7.50 | 40.07 | 35.72 |
krl | 9.39 | 42.74 | 38.24 |
vro | 8.64 | 39.89 | 35.97 |
vep | 6.73 | 38.15 | 33.91 |
lud | 3.11 | 31.50 | 27.30 |
Evaluated with Smugri Flores testset.
Inference API (serverless) does not yet support fairseq models for this pipeline type.