Add model card
Browse files
README.md
CHANGED
@@ -83,3 +83,165 @@ language:
|
|
83 |
- zh
|
84 |
license: mit
|
85 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
83 |
- zh
|
84 |
license: mit
|
85 |
---
|
86 |
+
|
87 |
+
# xmod-base
|
88 |
+
|
89 |
+
X-MOD is a multilingual masked language model trained on filtered CommonCrawl data containing 81 languages. It was introduced in the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) (Pfeiffer et al., NAACL 2022) and first released in [this repository](https://github.com/facebookresearch/fairseq/tree/main/examples/xmod).
|
90 |
+
|
91 |
+
Because it has been pre-trained with language-specific modular components (_language adapters_), X-MOD differs from previous multilingual models like [XLM-R](https://huggingface.co/xlm-roberta-base). For fine-tuning, the language adapters in each transformer layer are frozen.
|
92 |
+
|
93 |
+
# Usage
|
94 |
+
|
95 |
+
## Tokenizer
|
96 |
+
This model reuses the tokenizer of [XLM-R](https://huggingface.co/xlm-roberta-base), so you can load the tokenizer as follows:
|
97 |
+
|
98 |
+
```python
|
99 |
+
from transformers import AutoTokenizer
|
100 |
+
|
101 |
+
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
|
102 |
+
```
|
103 |
+
|
104 |
+
## Input Language
|
105 |
+
Because this model uses language adapters, you need to specify the language of your input so that the correct adapter can be activated:
|
106 |
+
|
107 |
+
```python
|
108 |
+
from transformers import XMODModel
|
109 |
+
|
110 |
+
model = XMODModel.from_pretrained("jvamvas/xmod-base")
|
111 |
+
model.set_default_language("en_XX")
|
112 |
+
```
|
113 |
+
|
114 |
+
A directory of the language adapters in this model is found at the bottom of this model card.
|
115 |
+
|
116 |
+
## Fine-tuning
|
117 |
+
The paper recommends that the embedding layer and the language adapters are frozen during fine-tuning. A method for doing this is provided in the code:
|
118 |
+
|
119 |
+
```python
|
120 |
+
model.freeze_embeddings_and_language_adapters()
|
121 |
+
# Fine-tune the model ...
|
122 |
+
```
|
123 |
+
|
124 |
+
## Cross-lingual Transfer
|
125 |
+
After fine-tuning, zero-shot cross-lingual transfer can be tested by activating the language adapter of the target language:
|
126 |
+
```python
|
127 |
+
model.set_default_language("de_DE")
|
128 |
+
# Evaluate the model on German examples ...
|
129 |
+
```
|
130 |
+
|
131 |
+
# Bias, Risks, and Limitations
|
132 |
+
|
133 |
+
Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because X-MOD has a similar architecture and has been trained on similar training data.
|
134 |
+
|
135 |
+
|
136 |
+
# Citation
|
137 |
+
|
138 |
+
**BibTeX:**
|
139 |
+
|
140 |
+
```bibtex
|
141 |
+
@inproceedings{pfeiffer-etal-2022-lifting,
|
142 |
+
title = "Lifting the Curse of Multilinguality by Pre-training Modular Transformers",
|
143 |
+
author = "Pfeiffer, Jonas and
|
144 |
+
Goyal, Naman and
|
145 |
+
Lin, Xi and
|
146 |
+
Li, Xian and
|
147 |
+
Cross, James and
|
148 |
+
Riedel, Sebastian and
|
149 |
+
Artetxe, Mikel",
|
150 |
+
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
|
151 |
+
month = jul,
|
152 |
+
year = "2022",
|
153 |
+
address = "Seattle, United States",
|
154 |
+
publisher = "Association for Computational Linguistics",
|
155 |
+
url = "https://aclanthology.org/2022.naacl-main.255",
|
156 |
+
doi = "10.18653/v1/2022.naacl-main.255",
|
157 |
+
pages = "3479--3495"
|
158 |
+
}
|
159 |
+
```
|
160 |
+
|
161 |
+
# Languages
|
162 |
+
|
163 |
+
This model contains the following language adapters:
|
164 |
+
|
165 |
+
| Language code | Language |
|
166 |
+
|---------------|-----------------------|
|
167 |
+
| af_ZA | Afrikaans |
|
168 |
+
| am_ET | Amharic |
|
169 |
+
| ar_AR | Arabic |
|
170 |
+
| az_AZ | Azerbaijani |
|
171 |
+
| be_BY | Belarusian |
|
172 |
+
| bg_BG | Bulgarian |
|
173 |
+
| bn_IN | Bengali |
|
174 |
+
| ca_ES | Catalan |
|
175 |
+
| cs_CZ | Czech |
|
176 |
+
| cy_GB | Welsh |
|
177 |
+
| da_DK | Danish |
|
178 |
+
| de_DE | German |
|
179 |
+
| el_GR | Greek |
|
180 |
+
| en_XX | English |
|
181 |
+
| eo_EO | Esperanto |
|
182 |
+
| es_XX | Spanish |
|
183 |
+
| et_EE | Estonian |
|
184 |
+
| eu_ES | Basque |
|
185 |
+
| fa_IR | Persian |
|
186 |
+
| fi_FI | Finnish |
|
187 |
+
| fr_XX | French |
|
188 |
+
| ga_IE | Irish |
|
189 |
+
| gl_ES | Galician |
|
190 |
+
| gu_IN | Gujarati |
|
191 |
+
| ha_NG | Hausa |
|
192 |
+
| he_IL | Hebrew |
|
193 |
+
| hi_IN | Hindi |
|
194 |
+
| hr_HR | Croatian |
|
195 |
+
| hu_HU | Hungarian |
|
196 |
+
| hy_AM | Armenian |
|
197 |
+
| id_ID | Indonesian |
|
198 |
+
| is_IS | Icelandic |
|
199 |
+
| it_IT | Italian |
|
200 |
+
| ja_XX | Japanese |
|
201 |
+
| ka_GE | Georgian |
|
202 |
+
| kk_KZ | Kazakh |
|
203 |
+
| km_KH | Central Khmer |
|
204 |
+
| kn_IN | Kannada |
|
205 |
+
| ko_KR | Korean |
|
206 |
+
| ku_TR | Kurdish |
|
207 |
+
| ky_KG | Kirghiz |
|
208 |
+
| la_VA | Latin |
|
209 |
+
| lo_LA | Lao |
|
210 |
+
| lt_LT | Lithuanian |
|
211 |
+
| lv_LV | Latvian |
|
212 |
+
| mk_MK | Macedonian |
|
213 |
+
| ml_IN | Malayalam |
|
214 |
+
| mn_MN | Mongolian |
|
215 |
+
| mr_IN | Marathi |
|
216 |
+
| ms_MY | Malay |
|
217 |
+
| my_MM | Burmese |
|
218 |
+
| ne_NP | Nepali |
|
219 |
+
| nl_XX | Dutch |
|
220 |
+
| no_XX | Norwegian |
|
221 |
+
| or_IN | Oriya |
|
222 |
+
| pa_IN | Punjabi |
|
223 |
+
| pl_PL | Polish |
|
224 |
+
| ps_AF | Pashto |
|
225 |
+
| pt_XX | Portuguese |
|
226 |
+
| ro_RO | Romanian |
|
227 |
+
| ru_RU | Russian |
|
228 |
+
| sa_IN | Sanskrit |
|
229 |
+
| si_LK | Sinhala |
|
230 |
+
| sk_SK | Slovak |
|
231 |
+
| sl_SI | Slovenian |
|
232 |
+
| so_SO | Somali |
|
233 |
+
| sq_AL | Albanian |
|
234 |
+
| sr_RS | Serbian |
|
235 |
+
| sv_SE | Swedish |
|
236 |
+
| sw_KE | Swahili |
|
237 |
+
| ta_IN | Tamil |
|
238 |
+
| te_IN | Telugu |
|
239 |
+
| th_TH | Thai |
|
240 |
+
| tl_XX | Tagalog |
|
241 |
+
| tr_TR | Turkish |
|
242 |
+
| uk_UA | Ukrainian |
|
243 |
+
| ur_PK | Urdu |
|
244 |
+
| uz_UZ | Uzbek |
|
245 |
+
| vi_VN | Vietnamese |
|
246 |
+
| zh_CN | Chinese (simplified) |
|
247 |
+
| zh_TW | Chinese (traditional) |
|