GPT2 - Persian

Scripts

Normalizer

from src.normalizer import normalize

input_text = "ὑ蕉Ұ제ṅ尘̲改座◦花芝秀黄天자埃澤ಿ ˈazbab اینجا ایران خانه‌شما است؟!۱۲۳۱۲۳۱۳۱۲ اَلْحُرُوفُ ٱلْعَرَبِیَّة"
print(normalize(input_text))

Output:

azbab اینجا ایران خانه‌شما است ؟ ! 1231231312 الحروف لعربیه

Training tokenizer

python train_tokenizer.py --dataset_name oscar --dataset_config_name unshuffled_deduplicated_als --vocab_size 42000

Configuration

python create_config.py --name_or_path gpt2-medium --params '{"vocab_size": 42000}'

Normalization steps

Steps:

Remove stretched words such as ســــــــــلام
Remove links, user-mentioning (such as @jane_doe)
Remove Telegram, Instagram advertisements, or posts (whole record)
Remove advertisement records
Remove separated words (or the whole record) which are showed up as an individual record, while they are just the tags at the end of the post (such as بلاب ... بلاب ... ورزشی، خبری، سیاسی، اجتماعی، خانوده)