|
--- |
|
language: |
|
- tr |
|
tags: |
|
- roberta |
|
license: cc-by-nc-sa-4.0 |
|
datasets: |
|
- oscar |
|
--- |
|
|
|
# RoBERTa Turkish medium WordPiece 44k (uncased) |
|
|
|
Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased. |
|
The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned. |
|
|
|
Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is WordPiece. Vocabulary size is 44.5k. |
|
|
|
The details can be found at this paper: |
|
https://arxiv.org/abs/2204.08832 |
|
|
|
The following code can be used for model loading and tokenization, example max length (514) can be changed: |
|
``` |
|
model = AutoModel.from_pretrained([model_path]) |
|
#for sequence classification: |
|
#model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes]) |
|
|
|
tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path]) |
|
tokenizer.mask_token = "[MASK]" |
|
tokenizer.cls_token = "[CLS]" |
|
tokenizer.sep_token = "[SEP]" |
|
tokenizer.pad_token = "[PAD]" |
|
tokenizer.unk_token = "[UNK]" |
|
tokenizer.bos_token = "[CLS]" |
|
tokenizer.eos_token = "[SEP]" |
|
tokenizer.model_max_length = 514 |
|
``` |
|
|
|
### BibTeX entry and citation info |
|
```bibtex |
|
@misc{https://doi.org/10.48550/arxiv.2204.08832, |
|
doi = {10.48550/ARXIV.2204.08832}, |
|
url = {https://arxiv.org/abs/2204.08832}, |
|
author = {Toraman, Cagri and Yilmaz, Eyup Halit and Şahinuç, Furkan and Ozcelik, Oguzhan}, |
|
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, |
|
title = {Impact of Tokenization on Language Models: An Analysis for Turkish}, |
|
publisher = {arXiv}, |
|
year = {2022}, |
|
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International} |
|
} |
|
``` |
|
|