--- license: mit language: - ar - he - vi - id - jv - ms - tl - lv - lt - eu - ml - ta - te - hy - bn - mr - hi - ur - af - da - en - de - sv - fr - it - pt - ro - es - el - os - tg - fa - ja - ka - ko - th - bxr - xal - mn - sw - yo - be - bg - ru - uk - pl - my - uz - ba - kk - ky - tt - az - cv - tr - tk - tyv - sax - et - fi - hu tags: - multilingual - PyTorch - Transformers - gpt3 - gpt2 - transformers --- Original model: https://huggingface.co/ai-forever/mGPT-13B # š» mGPT 13B Multilingual language model. This model was trained on the **61** languages from **25** language families (see the list below). ## Dataset Model was pretrained on a 600Gb of texts, mostly from MC4 and Wikipedia. Training data was deduplicated, the text deduplication includes 64-bit hashing of each text in the corpus for keeping texts with a unique hash. We also filter the documents based on their text compression rate using zlib4. The most strongly and weakly compressing deduplicated texts are discarded. Here is the table with number of tokens for each language in the pretraining corpus on a logarithmic scale: ![](https://i.imgur.com/KSMfVX1.png) ## Languages Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashkir (ba), Belarusian (be), Bengali (bn), Bulgarian (bg), Burmese (my), Buryat (bxr), Chuvash (cv), Danish (da), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Javanese (jv), Kalmyk (xal), Kazakh (kk), Korean (ko), Kyrgyz (ky), Latvian (lv), Lithuanian (lt), Malay (ms), Malayalam (ml), Marathi (mr), Mongolian (mn), Ossetian (os), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Spanish (es), Swedish (sv), Swahili (sw), Tatar (tt), Telugu (te), Thai (th), Turkish (tr), Turkmen (tk), Tuvan (tyv), Ukrainian (uk), Uzbek (uz), Vietnamese (vi), Yakut (sax), Yoruba (yo) #### By language family
Language Family | Languages |
---|---|
Afro-Asiatic | Arabic (ar), Hebrew (he) |
Austro-Asiatic | Vietnamese (vi) |
Austronesian | Indonesian (id), Javanese (jv), Malay (ms), Tagalog (tl) |
Baltic | Latvian (lv), Lithuanian (lt) |
Basque | Basque (eu) |
Dravidian | Malayalam (ml), Tamil (ta), Telugu (te) |
Indo-European (Armenian) | Armenian (hy) |
Indo-European (Indo-Aryan) | Bengali (bn), Marathi (mr), Hindi (hi), Urdu (ur) |
Indo-European (Germanic) | Afrikaans (af), Danish (da), English (en), German (de), Swedish (sv) |
Indo-European (Romance) | French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es) |
Indo-European (Greek) | Greek (el) |
Indo-European (Iranian) | Ossetian (os), Tajik (tg), Persian (fa) |
Japonic | Japanese (ja) |
Kartvelian | Georgian (ka) |
Koreanic | Korean (ko) |
Kra-Dai | Thai (th) |
Mongolic | Buryat (bxr), Kalmyk (xal), Mongolian (mn) |
Niger-Congo | Swahili (sw), Yoruba (yo) |
Slavic | Belarusian (be), Bulgarian (bg), Russian (ru), Ukrainian (uk), Polish (pl) |
Sino-Tibetan | Burmese (my) |
Turkic (Karluk) | Uzbek (uz) |
Turkic (Kipchak) | Bashkir (ba), Kazakh (kk), Kyrgyz (ky), Tatar (tt) |
Turkic (Oghuz) | Azerbaijani (az), Chuvash (cv), Turkish (tr), Turkmen (tk) |
Turkic (Siberian) | Tuvan (tyv), Yakut (sax) |
Uralic | Estonian (et), Finnish (fi), Hungarian (hu) |