Could you please disclose the full list of training data for embedding supervised finetuning?
#32
by
kwang2049
- opened
Although there is a general mention in the paper about it as
"Dataset with annotated negatives: We have prepared retrieval datasets, such as MSMarco [Bajaj et al., 2016] and Natural Questions
(NQ) [Kwiatkowski et al., 2019], in addition to multiple non-retrieval datasets like the Natural Language Inference (NLI) dataset [Bowman et al.,2015]. "
Could you please disclose the full list of the dataset names? This is very important for research work that wants to use Jina or follows it. Thanks in advance.
hi @kwang2049 yes, we used
- snli data from simcse: https://github.com/princeton-nlp/SimCSE#training, 1 hard negative + random negatives.
- msmarco, nq, quora-qa, hotpotqa and fever with mined hard negatives.
- cc news title description pairs with random negatives. https://huggingface.co/datasets/cc_news
each row consist of 17 items, including 1 anchor, 1 positive and 15 negatives.
Thanks❤️!
kwang2049
changed discussion status to
closed