llava-hf
/

llava-1.5-7b-hf

Image-Text-to-Text

Model card Files Files and versions Community

why len(processor.tokenizer)!= model_vocab_size ?

#5

by PerRing - opened Dec 9, 2023

PerRing

Dec 9, 2023

in 'llava-hf/llava-1.5-7b-hf',
len(processor.tokenizer) is 32002 and model_vocab_size(model.language_model.model.embed_tokens) is 32064.

Why are they different sizes? Shouldn't it be the same?

ArthurZ

Llava Hugging Face org Dec 9, 2023

WHy should they be the same?

ArthurZ

Llava Hugging Face org Dec 9, 2023

The tokenizer has the exact number of token we need, but the lm head need to be padded to a multiple for the number of SM on your machine for performance issues

ArthurZ

Llava Hugging Face org Dec 9, 2023

The assumption that they should be equal is wrong

ArthurZ

Llava Hugging Face org Dec 9, 2023

See here: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

PerRing

Dec 9, 2023

I thought it should be same because if model generate token id '32060', the tokenizer can't decode token id '32060'(because there is no token id '32060' in tokenizer).

ArthurZ

Llava Hugging Face org Dec 9, 2023

Yeah but the model would not generate these tokens because it was not taught to do so. It's a common practice to reserve vocabulary for users when they finetune etc

PerRing changed discussion status to closed Dec 10, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment