burakaytan
commited on
Commit
•
ffd6166
1
Parent(s):
04a2ff4
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,58 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- tr
|
5 |
---
|
6 |
+
🇹🇷 RoBERTaTurkish
|
7 |
+
|
8 |
+
## Model description
|
9 |
+
This is a Turkish RoBERTa base model pretrained on Turkish Wikipedia, Turkish OSCAR, and some news websites.
|
10 |
+
|
11 |
+
The final training corpus has a size of 38 GB and 329.720.508 sentences.
|
12 |
+
|
13 |
+
As Turkcell, we trained the model on an Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz with 256GB RAM and 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps.
|
14 |
+
|
15 |
+
# Usage
|
16 |
+
Load transformers library with:
|
17 |
+
```python
|
18 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
19 |
+
|
20 |
+
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-base-turkish-uncased")
|
21 |
+
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-base-turkish-uncased")
|
22 |
+
```
|
23 |
+
|
24 |
+
|
25 |
+
# Fill Mask Usage
|
26 |
+
|
27 |
+
```python
|
28 |
+
from transformers import pipeline
|
29 |
+
|
30 |
+
fill_mask = pipeline(
|
31 |
+
"fill-mask",
|
32 |
+
model="burakaytan/roberta-base-turkish-uncased",
|
33 |
+
tokenizer="burakaytan/roberta-base-turkish-uncased"
|
34 |
+
)
|
35 |
+
|
36 |
+
fill_mask("iki ülke arasında <mask> başladı")
|
37 |
+
|
38 |
+
[{'sequence': 'iki ülke arasında savaş başladı',
|
39 |
+
'score': 0.3013845384120941,
|
40 |
+
'token': 1359,
|
41 |
+
'token_str': ' savaş'},
|
42 |
+
{'sequence': 'iki ülke arasında müzakereler başladı',
|
43 |
+
'score': 0.1058429479598999,
|
44 |
+
'token': 30439,
|
45 |
+
'token_str': ' müzakereler'},
|
46 |
+
{'sequence': 'iki ülke arasında görüşmeler başladı',
|
47 |
+
'score': 0.07718811184167862,
|
48 |
+
'token': 4916,
|
49 |
+
'token_str': ' görüşmeler'},
|
50 |
+
{'sequence': 'iki ülke arasında kriz başladı',
|
51 |
+
'score': 0.07174749672412872,
|
52 |
+
'token': 3908,
|
53 |
+
'token_str': ' kriz'},
|
54 |
+
{'sequence': 'iki ülke arasında çatışmalar başladı',
|
55 |
+
'score': 0.05678590387105942,
|
56 |
+
'token': 19346,
|
57 |
+
'token_str': ' çatışmalar'}]
|
58 |
+
```
|