language: ja
license: cc-by-sa-4.0
datasets:
- wikipedia
- cc100
- oscar
mask_token: '[MASK]'
widget:
- text: '[MASK] 大学 で 自然 言語 処理 を 学ぶ 。'
nlp-waseda/bigbird-base-japanese
Model description
This is a Japanese BigBird base model pretrained on Japanese Wikipedia, the Japanese portion of CC-100, and the Japanese portion of OSCAR.
How to use
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/bigbird-base-japanese")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/bigbird-base-japanese")
sentence = '[MASK] 大学 で 自然 言語 処理 を 学ぶ 。' # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...
You can fine-tune this model on downstream tasks.
Tokenization
The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.
Vocabulary
The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.
Training procedure
This model was trained on Japanese Wikipedia (as of 20221101), the Japanese portion of CC-100, and the and the Japanese portion of OSCAR. It took two weeks using 16 NVIDIA A100 GPUs using transformers and DeepSpeed.
The following hyperparameters were used during pretraining:
- learning_rate: 1e-4
- per_device_train_batch_size: 6
- gradient_accumulation_steps: 2
- total_train_batch_size: 192
- max_seq_length: 4096
- training_steps: 600000
- warmup_steps: 6000
- bf16: true
- deepspeed: ds_config.json
Performance on JGLUE
We fine-tuned the following models and evaluated them on the dev set of JGLUE. We tuned learning rate and training epochs for each model and task following the JGLUE paper.
For the tasks other than MARC-ja, the maximum length is short, so the attention_type was set to "original_full", and fine-tuning was performed. For MARC-ja, both "block_sparse" and "original_full" were used.
Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
---|---|---|---|---|---|---|---|
Waseda RoBERTa base | 0.965 | 0.913 | 0.876 | 0.905 | 0.853 | 0.916 | 0.853 |
Waseda RoBERTa large (seq512) | 0.969 | 0.925 | 0.890 | 0.928 | 0.910 | 0.955 | 0.900 |
BigBird base (original_full) | 0.959 | 0.888 | 0.846 | 0.896 | 0.884 | 0.933 | 0.787 |
BigBird base (block_sparse) | 0.959 | - | - | - | - | - | - |
Acknowledgments
This work was supported by AI Bridging Cloud Infrastructure (ABCI) through the "Construction of a Japanese Large-Scale General-Purpose Language Model that Handles Long Sequences" at the 3rd ABCI Grand Challenge 2022.