Multi-Lingual?
#10
by
dejanseo
- opened
Tokenizer suggests a multi-lingual vocabulary. Would be interesting to hear more details about how much of your training data was non-English, and whether this is all just identical to original Mistral. I will put it to a test soon on a large multi-lingual website to find related pages for internal link recommendations.
I did test it but honestly can't tell the difference in embeddings quality between NV-Embed-v1 and LaBSE. In fact I think LaBSE is a little better at similarity mapping.