Issue with the tokenizer for French texts?
I observed a strange behavior of the tokenizer when dealing with texts in French. In particular, contrary to previous models, it seems to consistently remove the spaces before "!" or "?", e.g.
tokenizer.decode(tokenizer.encode("Ah ? Eh bien !"))
becomes
"Ah? Eh bien!"
(i.e. it defaults to English punctuation rules, which differs from the French one). I understand this might seem unimportant to some, but it does matter for my use case.
I can fix this by adding two spaces instead of one, but this does not feel like an elegant solution. Is there something I'm missing/doing wrong?
For context, I am using the transformers library (with the aim of fine-tuning the model).
Thanks! Should be fixed by https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/discussions/13.
Can confirm this fixes it :)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/Mistral-Nemo-Instruct-Tokenizer', revision='refs/pr/2')
tokenizer.decode(tokenizer.encode("Ah ? Eh bien !"))
# <s>Ah ? Eh bien !
Great, many thanks!
Will the tokenizer in the official repo change accordingly? In the meanwhile, I'll use yours :)
Just got merged! :) You can now access it normally.