Tokenized Chat template seems weird
#12
by
PierreLepagnol
- opened
I find it peculiar that elements from the chat template are segmented into several parts when tokenized.
Consider the primary illustration:
<|user|>
Which famous math number begins with 1.6 ...?<|endoftext|>
<|assistant|>
The number you are referring to is 1.618033988749895. This is the famous value known as the golden ratio<|endoftext|>
The tokens <|user|>
and <|assistant|>
are split into ['<', '|', 'user', '|', '>\n']
and ['<', '|', 'assistant', '|', '>\n']
respectively.
Does this seem standard?