Multi-Lingual?

#10
by dejanseo - opened

Tokenizer suggests a multi-lingual vocabulary. Would be interesting to hear more details about how much of your training data was non-English, and whether this is all just identical to original Mistral. I will put it to a test soon on a large multi-lingual website to find related pages for internal link recommendations.

I would also here some Infos about multi-lingual and code capabilities. @dejanseo have got any updates yet?

I did test it but honestly can't tell the difference in embeddings quality between NV-Embed-v1 and LaBSE. In fact I think LaBSE is a little better at similarity mapping.

Sign up or log in to comment