nlp-waseda
/

gpt2-small-japanese

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

gpt2-small-japanese / README.md

tide

Update README.md

2d14ed0 over 2 years ago

|

2.75 kB

	---
	language:
	- ja
	license: cc-by-sa-4.0
	datasets:
	- wikipedia
	- cc100
	widget:
	- text: "早稲田大学で自然言語処理を"
	---

	# nlp-waseda/gpt2-small-japanese

	This model is Japanese GPT-2 pretrained on Japanese Wikipedia and CC-100.

	## Intended uses & limitations

	You can use the raw model for text generation or fine-tune it to a downstream task.

	Note that the texts should be segmented into words using Juman++ in advance.

	### How to use

	You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

	```python
	>>> from transformers import pipeline, set_seed
	>>> generator = pipeline('text-generation', model='nlp-waseda/gpt2-small-japanese')
	>>> set_seed(42)
	>>> generator("早稲田大学で自然言語処理を", max_length=30, do_sample=True, pad_token_id=2, num_return_sequences=5)
	[{'generated_text': '早稲田大学で自然言語処理を学び、帰国後、早稲田大学理工学部に入学します。卒業後、早稲田大学工学研究科、'},
	{'generated_text': '早稲田大学で自然言語処理を学び、アメリカの大学で学士号を取得、修士の取得で博士号を取得。 2008 年'},
	{'generated_text': '早稲田大学で自然言語処理を勉強しています。学部は日本語学科を専攻しています。英語が話せるという'},
	{'generated_text': '早稲田大学で自然言語処理を専攻していた。 2011 年に第 26 回日本化学会学生委員会奨励賞 ( 第 2 年次審査'},
	{'generated_text': '早稲田大学で自然言語処理を中心とする言語学研究を行っている。東京都・豊島区のお見合い相手。'}]
	```

	Here is how to use this model to get the features of a given text in PyTorch:

	```python
	from transformers import ReformerTokenizer, GPT2Model
	tokenizer = ReformerTokenizer.from_pretrained('nlp-waseda/gpt2-small-japanese')
	model = GPT2Model.from_pretrained('nlp-waseda/gpt2-small-japanese')
	text = "早稲田大学で自然言語処理を"
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)
	```

	## Training data

	The GPT-2 model was pretrained on Japanese Wikipedia, dumped on 2022-03-20, and the Japanese portion of CC-100.

	## Training procedure

	### Preprocessing

	The texts are normalized using zenhan, segmented into words using Juman++, and tokenized using SentencePiece. Juman++ 2.0.0-rc3 was used for pretraining.

	The model was trained on 8 NVIDIA A100 GPUs.