|
--- |
|
language: |
|
- tr |
|
tags: |
|
- roberta |
|
license: cc-by-nc-sa-4.0 |
|
--- |
|
|
|
# RoBERTweetTurkCovid (uncased) |
|
|
|
Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased. |
|
The pretrained corpus is a Turkish tweets collection related to COVID-19. The details of the data can be found at this paper: |
|
https://arxiv.org/... |
|
|
|
Model architecture is similar to RoBERTa-base (12 layers, 12 heads, and 768 hidden size). Tokenization algorithm is WordPiece. Vocabulary size is 30k. |
|
|
|
The details of pretraining can be found at this paper: |
|
https://arxiv.org/... |
|
|
|
The following code can be used for model loading and tokenization, example max length (768) can be changed: |
|
``` |
|
model = AutoModel.from_pretrained([model_path]) |
|
#for sequence classification: |
|
#model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes]) |
|
|
|
tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path]) |
|
tokenizer.mask_token = "[MASK]" |
|
tokenizer.cls_token = "[CLS]" |
|
tokenizer.sep_token = "[SEP]" |
|
tokenizer.pad_token = "[PAD]" |
|
tokenizer.unk_token = "[UNK]" |
|
tokenizer.bos_token = "[CLS]" |
|
tokenizer.eos_token = "[SEP]" |
|
tokenizer.model_max_length = 768 |
|
``` |
|
|
|
### BibTeX entry and citation info |
|
```bibtex |
|
@article{} |
|
``` |
|
|