mistralai/Mistral-7B-v0.1 · Is there any way to increase the vocabulary of the tokenizer and use it fine tune the model on the new language

Jan 17

Hi, I'm trying to fine tune the mistral on my mother tongue tamil but when it fine tune it the output doesn't make any sense so i got to know the tokenizer is not able to understand the tamil. So, is there any way to increase the vocab of the tokenizer ?

ayadav

Jan 20

Hey @Tejaswi006 ,

I just tried base Mistral-Instruct model on some text from Wikipedia, and looking at the results it looks like it doesn't understands the language much.

However, since it's able to generate the text in Tamil script, the tokenizer should ideally work as-is. I think it requires more training on Tamil corpora instead of tokenizer modifications.

ayadav

Jan 20

•

edited Jan 20

If you still want to add some new tokens in the tokenizer, you should be able to do as following.

new_tokens = ["new_tok1", "my_new-tok2"]
num_added_toks = tokenizer.add_tokens(new_tokens)
print("We have added", num_added_toks, "tokens")

# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))

References:

https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.add_tokens

Tejaswi006

Jan 22

Thanks, I will look into it. The method you used was instruct fine tuning ?

Tvsybkzkmapab

Feb 4

i was trying the model to train on Amharic language but it generate text which does not make any sense