No special tokens ?
I noticed that <|prompt|> is not added as a special token, was that an error in training or should it be like this?
What are the advantages of this model in comparison to the official OpenAssistant 7B Falcon variant ?
I'm not familiar with the differences between the datasets, this one appears to use one of the early variants ?
Yes, these are not trained as additional tokens, but rather as text that get's tokenized into multiple tokens.
With a larger training corpus, it may be better to actually train new tokens and we do support this option in H2O LLM Studio. For the OASST dataset, just using words was usually superior.
The way regex-bpe works creates a ton of tokens out of the "special token" if it's not pre-tokenized:
39 -> '<'
103 -> '|'
18269 -> 'prom'
444 -> 'pt'
54146 -> '|>'
11 -> '<|endoftext|>'
39 -> '<'
103 -> '|'
46617 -> 'answer'
54146 -> '|>'
Using "<|XXX|>" could be harmful to the model, it has never seen such tokens following each other in any of the regular training.
It makes sense to use that syntax when creating a special token because it's the style TII chose but when not tokenizing them I'd recommend using a normal word instead.
You are more than welcome to test out normal words instead. The full training config is public in this repository. We saw that using special "words" are slightly slightly better in our evals.
I guess the model learns to react different if the word is never seen before.
Though <|prompt|> is 5 tokens, <|answer|> is 4