question for the pr of [update additional_special_tokens (#8)]

#10
by Qingyun - opened

This pr added additional_special_tokens, which seems result in mismatch of tokenizer length and vocablary size in my transformers==4.31.0 version.

  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|action_start|>",
    "<|action_end|>",
    "<|interpreter|>",
    "<|plugin|>"
  ],
tokenizer
ipdb> InternLM2Tokenizer(name_or_path='internlm/internlm2-chat-7b', vocab_size=92544, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|action_start|>', '<|action_end|>', '<|interpreter|>', '<|plugin|>']}, clean_up_tokenization_spaces=False)
len(tokenizer)
ipdb> 92550

It seems that the additional special tokens are made new ids, which is mismatched with the input_embeddings. But this pr seems to resolve the bug in 4.33.2 as described in this issue.

Qingyun changed discussion status to closed

Sign up or log in to comment