File size: 1,049 Bytes
de5164d
 
 
3f8acd4
 
 
 
 
 
 
 
de5164d
 
 
 
 
 
 
 
 
3f8acd4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 parameters: Namespace(corpus_dir='../datasets/online_novel//data.txt', output_dir='../models/baby-chinese-llama2', model_type='bpe', max_sentence_length=4096, vocab_size=32000, max_lines=1000000, shuffle_lines=True, pad_id=3, normalization_rule_name='identity', character_coverage=0.9995, action='export')
 
trainer_interface.cc(428) LOG(INFO) Normalizing sentences...
trainer_interface.cc(537) LOG(INFO) all chars count=796343461
trainer_interface.cc(548) LOG(INFO) Done: 99.95% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=5013
trainer_interface.cc(559) LOG(INFO) Final character coverage=0.9995
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 800000 sentences.
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 800000
trainer_interface.cc(608) LOG(INFO) Done! 1021909

Raw corpus
Total lines: 2995508
Total tokens: 1827.13MB
Mean: 610, Median: 606.0,
5th percentile: 546.0,
25th percentile: 580.0,
75th percentile: 636.0,
95th percentile: 686.0,
99th percentile: 722.0,
max: 2657