Abstract
In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Making Large Language Models A Better Foundation For Dense Retrieval (2023)
- Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval (2023)
- JaColBERT and Hard Negatives, Towards Better Japanese-First Embeddings for Retrieval: Early Technical Report (2023)
- RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement (2023)
- Noisy Self-Training with Synthetic Queries for Dense Retrieval (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Hi here @intfloat ! Great work, are you planning to release the full prompts and sampling values used in order to ease the reproduction for other languages, scenarios, lengths, etc.? Thanks in advance π€
Super cool stuff!
Hi here @intfloat ! Great work, are you planning to release the full prompts and sampling values used in order to ease the reproduction for other languages, scenarios, lengths, etc.? Thanks in advance π€
Hi @alvarobartt ,
Thanks for asking! We will release the full prompts and sampling values in the coming revision.
@intfloat Thanks for the contribution. Is there already a time line for the public release of the prompts?
Hi
@intfloat
,
Thanks for your fast response. I looked at Table 7 to Table 12.
I calculated all possible prompts that can be created from the prompt templates (unit, src_lang, ...).
It is 10822 unique prompts per language. For just one single language I would like to have more generations that just 10K.
Do you use the same prompts multiple times to generate multiple examples?
Or do you use each specific prompt just once?
@florianhoenicke For each GPT-4 call, we randomly sample all slot values, so the same prompt will be re-used multiple times for our data volume. Since we sample GPT-4 outputs, we still get different data points even for the same prompt.
Hey @intfloat ,
is the reported performance really fair? BEIR is meant to be zero-shot but the model is trained on a lot of the train sets of the BEIR benchmarks. Also Quora Duplicates is used to sample benchmark on BEIR. It would be really interesting how much of the performance is really coming from the synthetic data. Have you tried to train on the "full-data, without synthetic data"?
We managed to replicate your work with Open Source LLMs but found it to critical to publish it due to the overlap of training data with BEIR, which is a bit misleading here. Especially regarding the ranking on MTEB. I believe the approach is really promising.
Hi @angygraycat ,
We agree that our "full data" setting does not strictly follow the original BEIR setting (zero-shot, trained only on MS-MARCO). That's why we also report results for a "synthetic data + MS MARCO" setting, where the only supervision comes from the MS-MARCO passage ranking dataset.
We have included a "full-data, without synthetic data" setting in the latest version ("w/o synthetic data" in Table 15). A "Test Set Contamination Analysis" section is also included in Appendix B. Please check them out if you are interested in.
Our released model is intended to be a strong embedding model by utilizing as much supervised data as possible. For academic purpose, you may only compare to the "synthetic data + MS MARCO" setting.
And we want to point out that most top-performing models on the MTEB leaderboard use as much (if not more) supervised data as we do.
Thank you for your reply @intfloat . I think for the leaderboard only the synth + MS Marco setting should count, since it's how the benchmark is intended to. And I strongly agree that the most top performing models on MTEB are more then problematic. Moving away from zero shot prediction on BEIR makes the benchmark almost useless. The intention was that we have "locked up" datasets which we can use to evaluate the performance on out of distribution data like in real life, without having the pain collection and curating the data. However I really liked the paper, just asked for clarification :).
@intfloat
Thanks for the clear answer. I finished the implementation of your paper and generated examples. I found out that the examples cluster quite heavily because the task generation is quite deterministic - even with temperature 1.
For instance, after running two times the task generation prompt for short-long retrieval, I the following task descriptions for the requests as a first element of the 20 tasks:
- Retrieve articles and research papers that explain the latest advancements in renewable energy technology.
- Retrieve scientific articles matching a query about the latest advancements in renewable energy technologies.
Did you experience the same problem?
I have some suggestions and would be interested in your opinion.
a) maintain a list of tasks generated in the past and filter out duplicates by semantic matching
b) when creating new tasks, provide a sliding window list of the last x generated tasks as part of the prompt and instruct the model to create task descriptions that are semantically orthogonal to the ones that were created already.
c) increase temperature even more and introduce a llm-based filter step for low-quality task descriptions.
I'm curious to hear what you think.
d) give more inspiration by passing a random word from the English dictionary :D
Similar to your findings, I also observe that GPT-4 tends to favor certain topics such as climate change / cooking recipe / historical events etc. But the overall diversity is acceptable, so I only apply minimal deduplication based on exact string match.
Your suggestions totally make sense, a better prompting strategy will help generate better synthetic data.
About b), our current prompt of asking GPT-4 to generate 20 tasks in one output may already have similar effects.
Revolutionizing Text Embeddings with Synthetic Data from LLMs
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/