hiroshi-matsuda-rit
/

bert-base-sudachitra-v11

Inference Endpoints

Model card Files Files and versions Community

bert-base-sudachitra-v11 / README.md

hiroshi-matsuda-rit's picture

hiroshi-matsuda-rit

revise vocab.txt

4d0f905 over 1 year ago

|

2.07 kB

	---
	language: ja
	license: apache-2.0
	tags:
	- SudachiTra
	- Sudachi
	- SudachiPy
	- bert
	- Japanese
	- NWJC
	datasets:
	- NWJC
	---

	# bert-base-sudachitra-v11

	This model is a variant of SudachiTra.
	The differences between the original `chiTra v1.1` and `bert-base-sudachitra-v11` are:
	- `word_form_type` was changed from `normalized_nouns` to `surface`
	- Replacing continuous two empty lines with a dummy entry and an empty line in `vocab.txt`

	Also read the original `README.md` descriptions below.

	(See [GitHub - WorksApplications/SudachiTra](https://github.com/WorksApplications/SudachiTra) for the latest README)

	# Sudachi Transformers (chiTra)

	chiTra provides the pre-trained language models and a Japanese tokenizer for [Transformers](https://github.com/huggingface/transformers).

	## chiTra pretrained language model

	We used [NINJAL Web Japanese Corpus (NWJC)](https://pj.ninjal.ac.jp/corpus_center/nwjc/) from National Institute for Japanese Language and Linguistics which contains around 100 million web page text.

	NWJC was used after cleaning to remove unnecessary sentences.

	This model trained BERT using a pre-learning script implemented by [NVIDIA](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/LanguageModeling/BERT).

	## License

	Copyright (c) 2022 National Institute for Japanese Language and Linguistics and Works Applications Co., Ltd. All rights reserved.

	"chiTra" is distributed by [National Institute for Japanese Langauge and Linguistics](https://www.ninjal.ac.jp/) and [Works Applications Co.,Ltd.](https://www.worksap.co.jp/) under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

	## Citation

	```
	@INPROCEEDINGS{katsuta2022chitra,
	author = {勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸},
	title = {単語正規化による表記ゆれに頑健な BERT モデルの構築},
	booktitle = "言語処理学会第28回年次大会(NLP2022)",
	year = "2022",
	pages = "",
	publisher = "言語処理学会",
	}
	```