File size: 4,918 Bytes
cf339ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5736520
cf339ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9188929
 
cf339ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0faeb2
cf339ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0557b0a
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---

language: sk
tags:
- SlovakBERT
license: mit
datasets:
- wikipedia
- opensubtitles
- oscar
- gerulatawebcrawl
- gerulatamonitoring
- blbec.online
---


# SlovakBERT (base-sized model)
SlovakBERT pretrained model on Slovak language using a masked language modeling (MLM) objective. This model is case-sensitive: it makes a difference between slovensko and Slovensko.

## Intended uses & limitations
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
**IMPORTANT**: The model was not trained on the “ and ” (direct quote) character -> so before tokenizing the text, it is advised to replace all “ and ” (direct quote marks) with a single "(double quote marks).

### How to use
You can use this model directly with a pipeline for masked language modeling:

```python

from transformers import pipeline

unmasker = pipeline('fill-mask', model='gerulata/slovakbert')

unmasker("Deti sa <mask> na ihrisku.")



[{'sequence': 'Deti sa hrali na ihrisku.',

  'score': 0.6355380415916443,

  'token': 5949,

  'token_str': ' hrali'},

 {'sequence': 'Deti sa hrajú na ihrisku.',

  'score': 0.14731724560260773,

  'token': 9081,

  'token_str': ' hrajú'},

 {'sequence': 'Deti sa zahrali na ihrisku.',

  'score': 0.05016357824206352,

  'token': 32553,

  'token_str': ' zahrali'},

 {'sequence': 'Deti sa stretli na ihrisku.',

  'score': 0.041727423667907715,

  'token': 5964,

  'token_str': ' stretli'},

 {'sequence': 'Deti sa učia na ihrisku.',

  'score': 0.01886524073779583,

  'token': 18099,

  'token_str': ' učia'}]

```

Here is how to use this model to get the features of a given text in PyTorch:
```python

from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert')

model = RobertaModel.from_pretrained('gerulata/slovakbert')

text = "Text ktorý sa má embedovať."

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

```
and in TensorFlow:
```python

from transformers import RobertaTokenizer, TFRobertaModel

tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert')

model = TFRobertaModel.from_pretrained('gerulata/slovakbert')

text = "Text ktorý sa má embedovať."

encoded_input = tokenizer(text, return_tensors='tf')

output = model(encoded_input)

```
Or extract information from the model like this:
```python

from transformers import pipeline

unmasker = pipeline('fill-mask', model='gerulata/slovakbert')

unmasker("Slovenské národne povstanie sa uskutočnilo v roku <mask>.")



[{'sequence': 'Slovenske narodne povstanie sa uskutočnilo v roku 1944.',

  'score': 0.7383289933204651,

  'token': 16621,

  'token_str': ' 1944'},...]

```

# Training data
The SlovakBERT model was pretrained on these datasets:

- Wikipedia (326MB of text),
- OpenSubtitles (415MB of text),
- Oscar (4.6GB of text),
- Gerulata WebCrawl (12.7GB of text) ,
- Gerulata Monitoring (214 MB of text),
- blbec.online (4.5GB of text)

The text was then processed with the following steps:
- URL and email addresses were replaced with special tokens ("url", "email").
- Elongated interpunction was reduced (e.g. -- to -).
- Markdown syntax was deleted.
- All text content in braces f.g was eliminated to reduce the amount of markup and programming language text.

We segmented the resulting corpus into sentences and removed duplicates to get 181.6M unique sentences. In total, the final corpus has 19.35GB of text.

# Pretraining
The model was trained in **fairseq** on 4 x Nvidia A100 GPUs for 300K steps with a batch size of 512 and a sequence length of 512. The optimizer used is Adam with a learning rate of 5e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, dropout rate 0.1, learning rate warmup for 10k steps and linear decay of the learning rate after. We used 16-bit float precision.

## About us
<a href="https://www.gerulata.com/">
	<img width="300px" src="https://www.gerulata.com/images/gerulata-logo-blue.png">

</a>


Gerulata uses near real-time monitoring, advanced analytics and machine learning to help create a safer, more productive and enjoyable online environment for everyone.

### BibTeX entry and citation info
If you find our resource or paper is useful, please consider including the following citation in your paper.
- https://arxiv.org/abs/2109.15254

```

@misc{pikuliak2021slovakbert,

      title={SlovakBERT: Slovak Masked Language Model}, 

      author={Matúš Pikuliak and Štefan Grivalský and Martin Konôpka and Miroslav Blšták and Martin Tamajka and Viktor Bachratý and Marián Šimko and Pavol Balážik and Michal Trnka and Filip Uhlárik},

      year={2021},

      eprint={2109.15254},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```