Indonesian Sentence Embedding
Indonesian Sentence Embedding models based on supervised and unsupervised techniques. https://github.com/lazarusnlp/indonesian-sentence-embeddings/
Viewer • Updated • 2.88k • 72 • 2Note Machine-translated STS-B, translated using Google Translate API.
LazarusNLP/all-indo-e5-small-v4
Sentence Similarity • Updated • 2.14k • 2Note Our current best model for Indonesian sentence embeddings: `intfloat/multilingual-e5-small` fine-tuned on all available supervised Indonesian datasets (v4).
LazarusNLP/all-indo-e5-small-v3
Sentence Similarity • Updated • 5Note `intfloat/multilingual-e5-small` fine-tuned on all available supervised Indonesian datasets (v3).
LazarusNLP/all-indo-e5-small-v2
Sentence Similarity • Updated • 7Note `intfloat/multilingual-e5-small` fine-tuned on all available supervised Indonesian datasets (v2). Similar performance to the model above.
LazarusNLP/all-nusabert-base-v4
Sentence Similarity • Updated • 13Note `LazarusNLP/NusaBERT-base` fine-tuned on all available supervised Indonesian datasets (v4). Significant improvements from `LazarusNLP/all-indobert-base-v4` counterpart.
LazarusNLP/all-nusabert-large-v4
Sentence Similarity • Updated • 167 • 2Note `LazarusNLP/NusaBERT-large` fine-tuned on all available supervised Indonesian datasets (v4).
LazarusNLP/all-indobert-base-v4
Sentence Similarity • Updated • 1.49k • 2Note `indobenchmark/indobert-base-p1` fine-tuned on all available supervised Indonesian datasets (v4).
LazarusNLP/all-indobert-base-v2
Sentence Similarity • Updated • 24Note `indobenchmark/indobert-base-p1` fine-tuned on all available supervised Indonesian datasets (v2).
LazarusNLP/all-indobert-base
Sentence Similarity • Updated • 10Note Same as above, except with v1 of all supervised Indonesian datasets.
LazarusNLP/simcse-indobert-base
Sentence Similarity • Updated • 419 • 1Note `indobenchmark/indobert-base-p1` fine-tuned using unsupervised SimCSE on Wikipedia texts. This model was the initial baseline for other unsupervised trainings.
LazarusNLP/congen-indobert-base
Sentence Similarity • Updated • 5Note `indobenchmark/indobert-base-p1` fine-tuned using unsupervised ConGen on Wikipedia texts. Used `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` as teacher model for distillation. An improvement of the above.
LazarusNLP/congen-indobert-lite-base
Sentence Similarity • UpdatedNote Same setup as above, except with `indobenchmark/indobert-lite-base-p1` as the student model. Achieves a surprisingly decent performance despite its small size (11M lite; versus 127M above).
LazarusNLP/congen-simcse-indobert-base
Sentence Similarity • UpdatedNote Further-applying ConGen to `LazarusNLP/simcse-indobert-base`. Also used `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` as teacher model for distillation. Only slightly improving the student's model initial results.
LazarusNLP/congen-indo-e5-small
Sentence Similarity • UpdatedNote `intfloat/multilingual-e5-small` fine-tuned using unsupervised ConGen on Wikipedia texts. Used `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` as teacher model for distillation. Since the student model is better than the teacher model on certain tasks, this method slightly degrades its initial performance.
LazarusNLP/sct-indobert-base
Sentence Similarity • Updated • 4Note `indobenchmark/indobert-base-p1` fine-tuned using unsupervised SCT on Wikipedia texts. Used `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` as teacher model for distillation. Worse result compared to all ConGen setups thus far; further experiments necessary.