metadata

language: ja
license: cc-by-sa-4.0
datasets:
  - wikipedia
  - cc100
  - oscar
mask_token: '[MASK]'
widget:
  - text: '[MASK] 大学 で 自然 言語 処理 を 学ぶ 。'

nlp-waseda/bigbird-base-japanese

Model description

This is a Japanese BigBird base model pretrained on Japanese Wikipedia, the Japanese portion of CC-100, and the Japanese portion of OSCAR.

How to use

You can use this model for masked language modeling as follows:

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/bigbird-base-japanese")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/bigbird-base-japanese")

sentence = '[MASK] 大学 で 自然 言語 処理 を 学ぶ 。' # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

You can fine-tune this model on downstream tasks.

Tokenization

The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.

Vocabulary

The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.

Training procedure

This model was trained on Japanese Wikipedia (as of 20221101), the Japanese portion of CC-100, and the and the Japanese portion of OSCAR. It took two weeks using 16 NVIDIA A100 GPUs using transformers and DeepSpeed.

The following hyperparameters were used during pretraining:

learning_rate: 1e-4
per_device_train_batch_size: 6
gradient_accumulation_steps: 2
total_train_batch_size: 192
max_seq_length: 4096
training_steps: 600000
warmup_steps: 6000
bf16: true
deepspeed: ds_config.json

Performance on JGLUE

We fine-tuned the following models and evaluated them on the dev set of JGLUE. We tuned learning rate and training epochs for each model and task following the JGLUE paper.

For the tasks other than MARC-ja, the maximum length is short, so the attention_type was set to "original_full", and fine-tuning was performed. For MARC-ja, both "block_sparse" and "original_full" were used.

Model	MARC-ja/acc	JSTS/pearson	JSTS/spearman	JNLI/acc	JSQuAD/EM	JSQuAD/F1	JComQA/acc
Waseda RoBERTa base	0.965	0.913	0.876	0.905	0.853	0.916	0.853
Waseda RoBERTa large (seq512)	0.969	0.925	0.890	0.928	0.910	0.955	0.900
BigBird base (original_full)	0.959	0.888	0.846	0.896	0.884	0.933	0.787
BigBird base (block_sparse)	0.959	-	-	-	-	-	-

Acknowledgments

This work was supported by AI Bridging Cloud Infrastructure (ABCI) through the "Construction of a Japanese Large-Scale General-Purpose Language Model that Handles Long Sequences" at the 3rd ABCI Grand Challenge 2022.