metadata
language: ja
tags:
- t5
- text2text-generation
- seq2seq
license: apache-2.0
datasets:
- mc4
- wiki40b
t5-base-japanese-web-8k (with Byte-fallback, 8K)
Description
megagonlabs/t5-base-japanese-web-8k is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
Training codes are available on GitHub.
The vocabulary size of this model is 8K. 32K version is also available.
Corpora
We used following corpora for pre-training.
- Japanese in mC4/3.0.1 (We used Tensorflow native format)
- 87,425,304 pages
- 782 GB in TFRecord format
- Japanese in wiki40b/1.3.0
- 828,236 articles (2,073,584 examples)
- 2 GB in TFRecord format
Tokenizer
We used Japanese Wikipedia to train SentencePiece.
- Vocabulary size: 8,000
- Byte-fallback: Enabled
Parameters
- T5 model: models/t5.1.1.base.gin
- Training steps: 1,000,000
It took about 126 hours with TPU v3-8
Related models
License
Apache License 2.0
Citations
- mC4
Contains information from mC4
which is made available under the ODC Attribution License.
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}
- wiki40b
@inproceedings{49029,
title = {Wiki-40B: Multilingual Language Model Dataset},
author = {Mandy Guo and Zihang Dai and Denny Vrandecic and Rami Al-Rfou},
year = {2020},
booktitle = {LREC 2020}
}