Could you please disclose the full list of training data for embedding supervised finetuning?

#32

by kwang2049 - opened Nov 27, 2023

kwang2049

Nov 27, 2023

Although there is a general mention in the paper about it as

"Dataset with annotated negatives: We have prepared retrieval datasets, such as MSMarco [Bajaj et al., 2016] and Natural Questions
(NQ) [Kwiatkowski et al., 2019], in addition to multiple non-retrieval datasets like the Natural Language Inference (NLI) dataset [Bowman et al.,2015]. "

Could you please disclose the full list of the dataset names? This is very important for research work that wants to use Jina or follows it. Thanks in advance.

bwang0911

Jina AI org Nov 28, 2023

hi @kwang2049 yes, we used

snli data from simcse: https://github.com/princeton-nlp/SimCSE#training, 1 hard negative + random negatives.
msmarco, nq, quora-qa, hotpotqa and fever with mined hard negatives.
cc news title description pairs with random negatives. https://huggingface.co/datasets/cc_news

each row consist of 17 items, including 1 anchor, 1 positive and 15 negatives.

kwang2049

Nov 28, 2023

Thanks❤️！

kwang2049 changed discussion status to closed Nov 28, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment