flax-community
/

gpt2-medium-persian

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

m3hrdadfi commited on Jul 8, 2021

Commit

74e88fc

•

1 Parent(s): ec2c00e

Add normalization steps

Files changed (1) hide show

README.md +7 -0

README.md CHANGED Viewed

@@ -29,5 +29,12 @@ python train_tokenizer.py --dataset_name oscar --dataset_config_name unshuffled_
 python create_config.py --name_or_path gpt2-medium --params '{"vocab_size": 42000}'
 ```

 python create_config.py --name_or_path gpt2-medium --params '{"vocab_size": 42000}'
 ```
+### Normalization steps
+Steps:
+- [ ] Remove stretched words such as ســــــــــلام
+- [ ] Remove links, user-mentioning (such as @jane_doe)
+- [ ] Remove Telegram, Instagram advertisements, or posts (whole record)
+- [ ] Remove advertisement records
+- [ ] Remove separated words (or the whole record) which are showed up as an individual record, while they are just the tags at the end of the post (such as بلاب ... بلاب ... ورزشی، خبری، سیاسی، اجتماعی، خانوده)