|
--- |
|
language: ja |
|
license: apache-2.0 |
|
tags: |
|
- SudachiTra |
|
- Sudachi |
|
- SudachiPy |
|
- bert |
|
- Japanese |
|
- NWJC |
|
datasets: |
|
- NWJC |
|
--- |
|
|
|
# bert-base-sudachitra-v11 |
|
|
|
This model is a variant of SudachiTra. |
|
The differences between the original `chiTra v1.1` and `bert-base-sudachitra-v11` are: |
|
- `word_form_type` was changed from `normalized_nouns` to `surface` |
|
- Replacing continuous two empty lines with a dummy entry and an empty line in `vocab.txt` |
|
|
|
Also read the original `README.md` descriptions below. |
|
|
|
*(See [GitHub - WorksApplications/SudachiTra](https://github.com/WorksApplications/SudachiTra) for the latest README)* |
|
|
|
# Sudachi Transformers (chiTra) |
|
|
|
chiTra provides the pre-trained language models and a Japanese tokenizer for [Transformers](https://github.com/huggingface/transformers). |
|
|
|
## chiTra pretrained language model |
|
|
|
We used [NINJAL Web Japanese Corpus (NWJC)](https://pj.ninjal.ac.jp/corpus_center/nwjc/) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text. |
|
|
|
NWJC was used after cleaning to remove unnecessary sentences. |
|
|
|
This model trained BERT using a pre-learning script implemented by [NVIDIA](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/LanguageModeling/BERT). |
|
|
|
## License |
|
|
|
Copyright (c) 2022 National Institute for Japanese Language and Linguistics and Works Applications Co., Ltd. All rights reserved. |
|
|
|
"chiTra" is distributed by [National Institute for Japanese Langauge and Linguistics](https://www.ninjal.ac.jp/) and [Works Applications Co.,Ltd.](https://www.worksap.co.jp/) under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
## Citation |
|
|
|
``` |
|
@INPROCEEDINGS{katsuta2022chitra, |
|
author = {勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸}, |
|
title = {単語正規化による表記ゆれに頑健な BERT モデルの構築}, |
|
booktitle = "言語処理学会第28回年次大会(NLP2022)", |
|
year = "2022", |
|
pages = "", |
|
publisher = "言語処理学会", |
|
} |
|
``` |