atsuki-yamaguchi
commited on
Commit
β’
ee17438
1
Parent(s):
52d1835
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,60 @@
|
|
1 |
---
|
2 |
license: cc-by-nc-sa-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-nc-sa-4.0
|
3 |
+
datasets:
|
4 |
+
- wikipedia
|
5 |
+
- cc100
|
6 |
+
language:
|
7 |
+
- ja
|
8 |
+
library_name: transformers
|
9 |
+
pipeline_tag: fill-mask
|
10 |
---
|
11 |
+
|
12 |
+
BERT-base (Nothing + Unigram)
|
13 |
+
===
|
14 |
+
|
15 |
+
## How to load the tokenizer
|
16 |
+
Please download the dictionary file for Nothing + Unigram from [our GitHub repository](https://github.com/hitachi-nlp/compare-ja-tokenizer/blob/public/data/dict/nothing_unigram.json).
|
17 |
+
Then you can load the tokenizer by specifying the path of the dictionary file to `dict_path`.
|
18 |
+
|
19 |
+
```python
|
20 |
+
from typing import Optional
|
21 |
+
|
22 |
+
from tokenizers import Tokenizer, NormalizedString, PreTokenizedString
|
23 |
+
from tokenizers.processors import BertProcessing
|
24 |
+
from tokenizers.pre_tokenizers import PreTokenizer
|
25 |
+
from transformers import PreTrainedTokenizerFast
|
26 |
+
|
27 |
+
# load a tokenizer
|
28 |
+
dict_path = /path/to/nothing_unigram.json
|
29 |
+
tokenizer = Tokenizer.from_file(dict_path)
|
30 |
+
tokenizer.post_processor = BertProcessing(
|
31 |
+
cls=("[CLS]", tokenizer.token_to_id('[CLS]')),
|
32 |
+
sep=("[SEP]", tokenizer.token_to_id('[SEP]'))
|
33 |
+
)
|
34 |
+
|
35 |
+
# convert to PreTrainedTokenizerFast
|
36 |
+
tokenizer = PreTrainedTokenizerFast(
|
37 |
+
tokenizer_object=tokenizer,
|
38 |
+
unk_token='[UNK]',
|
39 |
+
cls_token='[CLS]',
|
40 |
+
sep_token='[SEP]',
|
41 |
+
pad_token='[PAD]',
|
42 |
+
mask_token='[MASK]'
|
43 |
+
)
|
44 |
+
```
|
45 |
+
|
46 |
+
```python
|
47 |
+
# Test
|
48 |
+
test_str = "γγγ«γ‘γ―γη§γ―ε½’ζ
η΄ θ§£ζε¨γ«γ€γγ¦η η©Άγγγ¦γγΎγγ"
|
49 |
+
tokenizer.convert_ids_to_tokens(tokenizer(test_str).input_ids)
|
50 |
+
# -> ['[CLS]','γγ','γ«','γ‘','γ―','γ','η§','γ―','ε½’ζ
','η΄ ','解ζ','ε¨','γ«γ€γγ¦','η η©Ά','γγγ¦','γγΎγ','γ','[SEP]']
|
51 |
+
```
|
52 |
+
|
53 |
+
## How to load the model
|
54 |
+
```python
|
55 |
+
from transformers import AutoModelForMaskedLM
|
56 |
+
model = AutoModelForMaskedLM.from_pretrained("hitachi-nlp/bert-base_nothing-unigram")
|
57 |
+
```
|
58 |
+
|
59 |
+
|
60 |
+
**See [our repository](https://github.com/hitachi-nlp/compare-ja-tokenizer) for more details!**
|