Fill-Mask
Transformers
PyTorch
xlm-roberta
Inference Endpoints
mc2-xlmr-large / README.md
luciusssss's picture
Update README.md
549b458 verified
|
raw
history blame
891 Bytes
metadata
license: mit
datasets:
  - pkupie/mc2_corpus
language:
  - bo
  - ug
  - mn
  - kk

MC^2XLMR-large

Github Repo

We continually pretrain XLM-RoBERTa-large with MC^2, which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.

See details in the paper.

We have also released another model trained on MC^2: MC^2Llama-13B.

Citation

@misc{zhang2023mc2,
      title={MC^2: A Multilingual Corpus of Minority Languages in China}, 
      author={Chen Zhang and Mingxu Tao and Quzhe Huang and Jiuheng Lin and Zhibin Chen and Yansong Feng},
      year={2023},
      eprint={2311.08348},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}