JminJ's picture
Update README.md
51a4437

Bad_text_classifier

Model ์†Œ๊ฐœ

์ธํ„ฐ๋„ท ์ƒ์— ํผ์ ธ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋Œ“๊ธ€, ์ฑ„ํŒ…์ด ๋ฏผ๊ฐํ•œ ๋‚ด์šฉ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์€ ๊ณต๊ฐœ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด label์„ ์ˆ˜์ •ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋“ค์„ ํ•ฉ์ณ ๊ตฌ์„ฑํ•ด finetuning์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์ด ์–ธ์ œ๋‚˜ ๋ชจ๋“  ๋ฌธ์žฅ์„ ์ •ํ™•ํžˆ ํŒ๋‹จ์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ์  ์–‘ํ•ดํ•ด ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

NOTE)
๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์˜ ์ €์ž‘๊ถŒ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ณ€ํ˜•๋œ ๋ฐ์ดํ„ฐ๋Š” ๊ณต๊ฐœ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์„ ๋ฐํž™๋‹ˆ๋‹ค.
๋˜ํ•œ ํ•ด๋‹น ๋ชจ๋ธ์˜ ์˜๊ฒฌ์€ ์ œ ์˜๊ฒฌ๊ณผ ๋ฌด๊ด€ํ•˜๋‹ค๋Š” ์ ์„ ๋ฏธ๋ฆฌ ๋ฐํž™๋‹ˆ๋‹ค.

Dataset

data label

  • 0 : bad sentence
  • 1 : not bad sentence

์‚ฌ์šฉํ•œ dataset

dataset ๊ฐ€๊ณต ๋ฐฉ๋ฒ•

๊ธฐ์กด ์ด์ง„ ๋ถ„๋ฅ˜๊ฐ€ ์•„๋‹ˆ์˜€๋˜ ๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜ ํ˜•ํƒœ๋กœ labeling์„ ๋‹ค์‹œ ํ•ด์ค€ ๋’ค, Korean HateSpeech Dataset์ค‘ label 1(not bad sentence)๋งŒ์„ ์ถ”๋ ค ๊ฐ€๊ณต๋œ Korean Unsmile Dataset์— ํ•ฉ์ณ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

Korean Unsmile Dataset์— clean์œผ๋กœ labeling ๋˜์–ด์žˆ๋˜ ๋ฐ์ดํ„ฐ ์ค‘ ๋ช‡๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • "~๋…ธ"๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์žฅ ์ค‘, "์ด๊ธฐ", "๋…ธ๋ฌด"๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •
  • "์ข†", "๋ดŠ" ๋“ฑ ์„ฑ ๊ด€๋ จ ๋‰˜์•™์Šค๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •

Model Training

  • huggingface transformers์˜ ElectraForSequenceClassification๋ฅผ ์‚ฌ์šฉํ•ด finetuning์„ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ํ•œ๊ตญ์–ด ๊ณต๊ฐœ Electra ๋ชจ๋ธ ์ค‘ 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ํ•™์Šต์‹œ์ผœ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

use model

How to use model?

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('JminJ/koElectra_base_Bad_Sentence_Classifier')
tokenizer = AutoTokenizer.from_pretrained('JminJ/koElectra_base_Bad_Sentence_Classifier')

Model Valid Accuracy

mdoel accuracy
kcElectra_base_fp16_wd_custom_dataset 0.8849
tunibElectra_base_fp16_wd_custom_dataset 0.8726
koElectra_base_fp16_wd_custom_dataset 0.8434
Note)
๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Contact

Github

Reference