why len(processor.tokenizer)!= model_vocab_size ?
in 'llava-hf/llava-1.5-7b-hf',
len(processor.tokenizer) is 32002 and model_vocab_size(model.language_model.model.embed_tokens) is 32064.
Why are they different sizes? Shouldn't it be the same?
WHy should they be the same?
The tokenizer has the exact number of token we need, but the lm head need to be padded to a multiple for the number of SM on your machine for performance issues
The assumption that they should be equal is wrong
I thought it should be same because if model generate token id '32060', the tokenizer can't decode token id '32060'(because there is no token id '32060' in tokenizer).
Yeah but the model would not generate these tokens because it was not taught to do so. It's a common practice to reserve vocabulary for users when they finetune etc