Why is config.vocab_size != tokenizer.vocab_size?
#18
by
Qubitium
- opened
@abhi-db
@hanlintang
@srowen
Why is there a huge discrepancy between the model config.vocab_size
(100352) and the actual tokenizer. vocab_size
(100277). This is very strange.
The model vocab size is padded to a larger value to 1) improve matmul efficiency and 2) leave space for extra tokens in case folks would like to finetune the model with special token ids.
abhi-db
changed discussion status to
closed