KR-BERT-char16424 / README.md
snunlp's picture
Create README.md
4752196
|
raw
history blame
9.3 kB
metadata
language:
  - ko

KoRean based Bert pre-trained (KR-BERT)

This is a release of Korean-specific, small-scale BERT models with comparable or better performances developed by Computational Linguistics Lab at Seoul National University, referenced in KR-BERT: A Small-Scale Korean-Specific Language Model.


Vocab, Parameters and Data

Mulitlingual BERT
(Google)
KorBERT
(ETRI)
KoBERT
(SKT)
KR-BERT character KR-BERT sub-character
vocab size 119,547 30,797 8,002 16,424 12,367
parameter size 167,356,416 109,973,391 92,186,880 99,265,066 96,145,233
data size -
(The Wikipedia data
for 104 languages)
23GB
4.7B morphemes
-
(25M sentences,
233M words)
2.47GB
20M sentences,
233M words
2.47GB
20M sentences,
233M words
Model Masked LM Accuracy
KoBERT 0.750
KR-BERT character BidirectionalWordPiece 0.779
KR-BERT sub-character BidirectionalWordPiece 0.769

Sub-character

Korean text is basically represented with Hangul syllable characters, which can be decomposed into sub-characters, or graphemes. To accommodate such characteristics, we trained a new vocabulary and BERT model on two different representations of a corpus: syllable characters and sub-characters.

In case of using our sub-character model, you should preprocess your data with the code below.

import torch
from transformers import BertConfig, BertModel, BertForPreTraining, BertTokenizer
from unicodedata import normalize

tokenizer_krbert = BertTokenizer.from_pretrained('/path/to/vocab_file.txt', do_lower_case=False)

# convert a string into sub-char
def to_subchar(string):
    return normalize('NFKD', string)

sentence = 'ํ† ํฌ๋‚˜์ด์ € ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.'
print(tokenizer_krbert.tokenize(to_subchar(sentence)))

Tokenization

BidirectionalWordPiece Tokenizer

We use the BidirectionalWordPiece model to reduce search costs while maintaining the possibility of choice. This model applies BPE in both forward and backward directions to obtain two candidates and chooses the one that has a higher frequency.

Mulitlingual BERT KorBERT
character
KoBERT KR-BERT
character
WordPiece
KR-BERT
character
BidirectionalWordPiece
KR-BERT
sub-character
WordPiece
KR-BERT
sub-character
BidirectionalWordPiece
๋ƒ‰์žฅ๊ณ 
nayngcangko
"refrigerator"
๋ƒ‰#์žฅ#๊ณ 
nayng#cang#ko
๋ƒ‰#์žฅ#๊ณ 
nayng#cang#ko
๋ƒ‰#์žฅ#๊ณ 
nayng#cang#ko
๋ƒ‰์žฅ๊ณ 
nayngcangko
๋ƒ‰์žฅ๊ณ 
nayngcangko
๋ƒ‰์žฅ๊ณ 
nayngcangko
๋ƒ‰์žฅ๊ณ 
nayngcangko
์ถฅ๋‹ค
chwupta
"cold"
[UNK] ์ถฅ#๋‹ค
chwup#ta
์ถฅ#๋‹ค
chwup#ta
์ถฅ#๋‹ค
chwup#ta
์ถฅ#๋‹ค
chwup#ta
์ถ”#ใ…‚๋‹ค
chwu#pta
์ถ”#ใ…‚๋‹ค
chwu#pta
๋ฑƒ์‚ฌ๋žŒ
paytsalam
"seaman"
[UNK] ๋ฑƒ#์‚ฌ๋žŒ
payt#salam
๋ฑƒ#์‚ฌ๋žŒ
payt#salam
๋ฑƒ#์‚ฌ๋žŒ
payt#salam
๋ฑƒ#์‚ฌ๋žŒ
payt#salam
๋ฐฐ#ใ……#์‚ฌ๋žŒ
pay#t#salam
๋ฐฐ#ใ……#์‚ฌ๋žŒ
pay#t#salam
๋งˆ์ดํฌ
maikhu
"microphone"
๋งˆ#์ด#ํฌ
ma#i#khu
๋งˆ์ด#ํฌ
mai#khu
๋งˆ#์ด#ํฌ
ma#i#khu
๋งˆ์ดํฌ
maikhu
๋งˆ์ดํฌ
maikhu
๋งˆ์ดํฌ
maikhu
๋งˆ์ดํฌ
maikhu

Models

TensorFlow PyTorch
character sub-character character sub-character
WordPiece
tokenizer
WP char WP subchar WP char WP subchar
Bidirectional
WordPiece
tokenizer
BiWP char BiWP subchar BiWP char BiWP subchar

Requirements

  • transformers == 2.1.1
  • tensorflow < 2.0

Downstream tasks

Naver Sentiment Movie Corpus (NSMC)

  • If you want to use the sub-character version of our models, let the subchar argument be True.

  • And you can use the original BERT WordPiece tokenizer by entering bert for the tokenizer argument, and if you use ranked you can use our BidirectionalWordPiece tokenizer.

  • tensorflow: After downloading our pretrained models, put them in a models directory in the krbert_tensorflow directory.

  • pytorch: After downloading our pretrained models, put them in a pretrained directory in the krbert_pytorch directory.

# pytorch
python3 train.py --subchar {True, False} --tokenizer {bert, ranked}

# tensorflow
python3 run_classifier.py \
  --task_name=NSMC \
  --subchar={True, False} \
  --tokenizer={bert, ranked} \
  --do_train=true \
  --do_eval=true \
  --do_predict=true \
  --do_lower_case=False\
  --max_seq_length=128 \
  --train_batch_size=128 \
  --learning_rate=5e-05 \
  --num_train_epochs=5.0 \
  --output_dir={output_dir}

The pytorch code structure refers to that of https://github.com/aisolab/nlp_implementation .


NSMC Acc.

multilingual BERT KorBERT KoBERT KR-BERT character WordPiece KR-BERT
character Bidirectional WordPiece
KR-BERT sub-character WordPiece KR-BERT
sub-character Bidirectional WordPiece
pytorch - 89.84 89.01 89.34 89.38 89.20 89.34
tensorflow 87.08 85.94 n/a 89.86 90.10 89.76 89.86

Citation

If you use these models, please cite the following paper:

@article{lee2020krbert,
    title={KR-BERT: A Small-Scale Korean-Specific Language Model},
    author={Sangah Lee and Hansol Jang and Yunmee Baik and Suzi Park and Hyopil Shin},
    year={2020},
    journal={ArXiv},
    volume={abs/2008.03979}
  }

Contacts

[email protected]