Tokenizer and Model Config Mismatch

#10
by keremturgutlu - opened

Config and tokenizer has different special token ids, which can be a problem for finetuning.

pretrained_config = AutoConfig.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

(pretrained_config.eos_token_id, tokenizer.eos_token_id, 
pretrained_config.bos_token_id, tokenizer.bos_token_id)
>>
(2, 11, 1, None)

Yes, this is really redicoulous.

I agree too, and actually don't understand what we have to choose

@tiiuae Please avoid upload a wrong model (wrong tokenizer), this will missleading lots of people .

FalconLLM changed discussion status to closed

@FalconLLM Please fix the issue, or at least post some explain on this, otherwise your behaviour might against hugginface community rules.
Users might get confused by your uploaded model. And this is not good for you as well.

@lucasjin they fixed config.json

  "bos_token_id": 11,
  "eos_token_id": 11,

@dimaischenko OK, but this still make me confused, why bos is 11? Very strange

And the bos same as eos...... Very strange....

Sign up or log in to comment