eos_token clarification

by Starlento - opened May 27

May 27

I found in tokenizer_config.json, it is a standard chatml template. But I checked the model it seems to use <|endoftext|> as eos_token for some cases.

Inference code:

messages = [
    {"role": "user", "content": "你好"}
]

input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, return_tensors='pt')
output_ids = text_model.generate(input_ids.to('cuda'), eos_token_id=tokenizer.eos_token_id, max_length=256)
response = tokenizer.decode(output_ids[0], skip_special_tokens=False)
# response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Result:

<|im_start|>user
你好<|im_end|> 
<|im_start|>assistant
你好！有什么我可以帮助你的吗？<|endoftext|>你好！有什么我可以帮助你的吗？

如果你有任何问题或需要信息，请随时告诉我！我在这里帮助你。<|im_end|>

But for "hi", it seems normal.

Could you kindly check this problem?

raincandy-u

May 27

same problem

nazimali

29 days ago

Hello, any update on this?

haijian06

01-ai org 29 days ago

Hi 👋 We tried to reproduce the issue, but we didn't encounter any errors during inference. Could you please try again using this inference code: https://github.com/01-ai/Yi/blob/main/Cookbook/en/opensource/Inference/Inference_using_transformers.ipynb

nazimali

29 days ago

Without running the model, you can also see the issue is a misconfiguration:

The tokenizer_config.json chat_template and eos_tokendefine it as <|im_end|>
But the model config.json defines eos_token_id=2, which maps to <|endoftext|>
- The tokenizer config defines token ID 2 as <|endoftext|>

There's a mismatch between the models config.json and tokenizer_config.json chat template and eos_token.

nazimali

29 days ago

My guess is the reason @Starlento example is returning inconsistent responses is because eos_token_id=tokenizer.eos_token_id is passed, but this should work and indicates the misconfiguration between config.json and tokenizer_config.json.

output_ids = text_model.generate(input_ids.to('cuda'), eos_token_id=tokenizer.eos_token_id, max_length=256)

haijian06

01-ai org 28 days ago

Okay, @nazimali , I understand what you're saying. However, if you try replacing <|im_end|> with <|endoftext|>, you'll encounter an error. It might be better to keep it as it is for now.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment