Is there any way to increase the vocabulary of the tokenizer and use it fine tune the model on the new language
Hi, I'm trying to fine tune the mistral on my mother tongue tamil but when it fine tune it the output doesn't make any sense so i got to know the tokenizer is not able to understand the tamil. So, is there any way to increase the vocab of the tokenizer ?
Hey @Tejaswi006 ,
I just tried base Mistral-Instruct model on some text from Wikipedia, and looking at the results it looks like it doesn't understands the language much.
However, since it's able to generate the text in Tamil script, the tokenizer should ideally work as-is. I think it requires more training on Tamil corpora instead of tokenizer modifications.
If you still want to add some new tokens in the tokenizer, you should be able to do as following.
new_tokens = ["new_tok1", "my_new-tok2"]
num_added_toks = tokenizer.add_tokens(new_tokens)
print("We have added", num_added_toks, "tokens")
# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))
References:
Thanks, I will look into it. The method you used was instruct fine tuning ?
i was trying the model to train on Amharic language but it generate text which does not make any sense