File size: 2,946 Bytes
59909fc
 
 
 
 
 
 
 
977d2d8
ccf68f4
 
 
 
 
a3c0d6c
ccf68f4
 
 
 
 
 
 
 
 
67c6d7e
 
 
9ce99f7
 
 
 
 
 
118b024
9ce99f7
a3c0d6c
ccf68f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3c0d6c
ed0d51e
90e921d
 
 
 
 
a3c0d6c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
language: "tr"
tags:
- turkish
- tr
- gpt2-tr
- gpt2-turkish
---
# 🇹🇷 Turkish GPT-2 Model

In this repository I release GPT-2 model, that was trained on various texts for Turkish.

The model is meant to be an entry point for fine-tuning on other texts.

## Training corpora

I used a Turkish corpora that is taken from oscar-corpus.

It was possible to create byte-level BPE with Tokenizers library of Huggingface.

With the Tokenizers library, I created a 52K byte-level BPE vocab based on the training corpora.

After creating the vocab, I could train the GPT-2 for Turkish on two 2080TI over the complete training corpus (five epochs).

Logs during training:
https://tensorboard.dev/experiment/3AWKv8bBTaqcqZP5frtGkw/#scalars

## Model weights

Both PyTorch and Tensorflow compatible weights are available.

| Model                             | Downloads
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------
| `redrussianarmy/gpt2-turkish-cased`   | [`config.json`](https://huggingface.co/redrussianarmy/gpt2-turkish-cased/resolve/main/config.json) • [`merges.txt`](https://huggingface.co/redrussianarmy/gpt2-turkish-cased/resolve/main/merges.txt) • [`pytorch_model.bin`](https://huggingface.co/redrussianarmy/gpt2-turkish-cased/resolve/main/pytorch_model.bin) • [`special_tokens_map.json`](https://huggingface.co/redrussianarmy/gpt2-turkish-cased/resolve/main/special_tokens_map.json) • [`tf_model.h5`](https://huggingface.co/redrussianarmy/gpt2-turkish-cased/resolve/main/tf_model.h5) • [`tokenizer_config.json`](https://huggingface.co/redrussianarmy/gpt2-turkish-cased/resolve/main/tokenizer_config.json) • [`traning_args.bin`](https://huggingface.co/redrussianarmy/gpt2-turkish-cased/resolve/main/training_args.bin) • [`vocab.json`](https://huggingface.co/redrussianarmy/gpt2-turkish-cased/resolve/main/vocab.json)

## Using the model

The model itself can be used in this way:

``` python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("redrussianarmy/gpt2-turkish-cased")
model = AutoModelWithLMHead.from_pretrained("redrussianarmy/gpt2-turkish-cased")
```

Here's an example that shows how to use the great Transformers Pipelines for generating text:

``` python
from transformers import pipeline
pipe = pipeline('text-generation', model="redrussianarmy/gpt2-turkish-cased",
                 tokenizer="redrussianarmy/gpt2-turkish-cased", config={'max_length':800})   
text = pipe("Akşamüstü yolda ilerlerken, ")[0]["generated_text"]
print(text)
```

### How to clone the model repo?
```
git lfs install
git clone https://huggingface.co/redrussianarmy/gpt2-turkish-cased
```

## Contact (Bugs, Feedback, Contribution and more)
For questions about the GPT2-Turkish model, just open an issue [here](https://github.com/redrussianarmy/gpt2-turkish/issues) 🤗