GPT2-small / README.md
Datascience-Lab's picture
Update README.md
276eb8e
metadata
license: apache-2.0
tags:
  - gpt2
language: ko

KoGPT2-small

Model Batch Size Tokenizer Vocab Size Max Length Parameter Size
GPT2 64 BPE 30,000 1024 108M

DataSet

  • AIhub - ์›น๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ (4.8M)
  • KoWiki dump 230701 (1.4M)

Inference Example

from transformers import AutoTokenizer, GPT2LMHeadModel

text = "์ถœ๊ทผ์ด ํž˜๋“ค๋ฉด"

tokenizer = AutoTokenizer.from_pretrained('Datascience-Lab/GPT2-small')
model = GPT2LMHeadModel.from_pretrained('Datascience-Lab/GPT2-small')

inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=False)

outputs = model.generate(inputs['input_ids'], max_length=128, 
                           repetition_penalty=2.0,
                           pad_token_id=tokenizer.pad_token_id,
                           eos_token_id=tokenizer.eos_token_id,
                           bos_token_id=tokenizer.bos_token_id,
                           use_cache=True,
                           temperature = 0.5)
outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)

# ์ถœ๋ ฅ ๊ฒฐ๊ณผ : '์ถœ๊ทผ์ด ํž˜๋“ค๋ฉด ์ถœ๊ทผ์„ ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ํ•˜์ง€๋งŒ ์ถœํ‡ด๊ทผ ์‹œ๊ฐ„์„ ๋Šฆ์ถ”๋Š” ๊ฒƒ์€ ์˜คํžˆ๋ ค ๊ฑด๊ฐ•์— ์ข‹์ง€ ์•Š๋‹ค.. ํŠนํžˆ๋‚˜ ์žฅ์‹œ๊ฐ„์˜ ์—…๋ฌด๋กœ ์ธํ•ด ํ”ผ๋กœ๊ฐ€ ์Œ“์ด๊ณ  ๋ฉด์—ญ๋ ฅ์ด ๋–จ์–ด์ง€๋ฉด, ํ”ผ๋กœ๊ฐ์ด ์‹ฌํ•ด์ ธ์„œ ์ž ๋“ค๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ผ๋ฉด ํ‰์†Œ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์œผ๋กœ ๊ณผ์‹์„ ํ•˜๊ฑฐ๋‚˜ ๋ฌด๋ฆฌํ•œ ๋‹ค์ด์–ดํŠธ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹๋‹จ ์กฐ์ ˆ๊ณผ ํ•จ๊ป˜ ์˜์–‘ ๋ณด์ถฉ์— ์‹ ๊ฒฝ ์จ์•ผ ํ•œ๋‹ค. ๋˜ํ•œ ๊ณผ๋„ํ•œ ์Œ์‹์ด ์ฒด์ค‘ ๊ฐ๋Ÿ‰์— ๋„์›€์„ ์ฃผ๋ฏ€๋กœ ์ ์ ˆํ•œ ์šด๋™๋Ÿ‰์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•˜๋‹ค.'