File size: 2,072 Bytes
82a47a9
 
 
 
 
 
 
 
 
 
 
 
 
 
4bd6d24
 
 
 
 
4d0f905
4bd6d24
0d0c4a8
4bd6d24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82a47a9
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
language: ja
license: apache-2.0
tags:
- SudachiTra
- Sudachi
- SudachiPy
- bert
- Japanese
- NWJC
datasets:
- NWJC
---

# bert-base-sudachitra-v11

This model is a variant of SudachiTra.
The differences between the original `chiTra v1.1` and `bert-base-sudachitra-v11` are:
- `word_form_type` was changed from `normalized_nouns` to `surface`
- Replacing continuous two empty lines with a dummy entry and an empty line in `vocab.txt`

Also read the original `README.md` descriptions below.

*(See [GitHub - WorksApplications/SudachiTra](https://github.com/WorksApplications/SudachiTra) for the latest README)*

# Sudachi Transformers (chiTra)

chiTra provides the pre-trained language models and a Japanese tokenizer for [Transformers](https://github.com/huggingface/transformers).

## chiTra pretrained language model

We used [NINJAL Web Japanese Corpus (NWJC)](https://pj.ninjal.ac.jp/corpus_center/nwjc/) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text.

NWJC was used after cleaning to remove unnecessary sentences.

This model trained BERT using a pre-learning script implemented by [NVIDIA](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/LanguageModeling/BERT).

## License

Copyright (c) 2022 National Institute for Japanese Language and Linguistics and Works Applications Co., Ltd. All rights reserved.

"chiTra" is distributed by [National Institute for Japanese Langauge and Linguistics](https://www.ninjal.ac.jp/) and [Works Applications Co.,Ltd.](https://www.worksap.co.jp/) under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

## Citation

```
@INPROCEEDINGS{katsuta2022chitra,
    author    = {勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸},
    title     = {単語正規化による表記ゆれに頑健な BERT モデルの構築},
    booktitle = "言語処理学会第28回年次大会(NLP2022)",
    year      = "2022",
    pages     = "",
    publisher = "言語処理学会",
}
```