jinaai
/

jina-embeddings-v3

@@ -21528,7 +21528,7 @@ model-index:
 </p>
 <p align="center">
-<b>Jina Embedding V3: A Multilingual Multi-Task Embedding Model</b>
 </p>
 ## Quick Start
@@ -21541,12 +21541,12 @@ The easiest way to start using `jina-embeddings-v3` is with the [Jina Embedding
 `jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
 Based on the [Jina-XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
-this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long input sequences up to **8192 tokens**.
-Additionally, it features 5 [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
 ### Key Features:
 - **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
-- **Task-Specific Embedding:** Customize embeddings through the `task_type` argument with the following options:
     - `retrieval.query`: Used for query embeddings in asymmetric retrieval tasks
     - `retrieval.passage`: Used for passage embeddings in asymmetric retrieval tasks
     - `separation`: Used for embeddings in clustering and re-ranking applications
@@ -21560,11 +21560,6 @@ While the foundation model supports 89 languages, we've focused our tuning effor
 Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
 Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
-## Data & Parameters
-The data and training details are described in the technical report (coming soon).
 ## Usage
 **<details><summary>Apply mean pooling when integrating the model.</summary>**
@@ -21605,7 +21600,7 @@ model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():
-    model_output = model(**encoded_input, task_type='retrieval.query')
 embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
 embeddings = F.normalize(embeddings, p=2, dim=1)
@@ -21643,10 +21638,10 @@ texts = [
     "Folge dem weißen Kaninchen.",  # German
 ]
-# When calling the `encode` function, you can choose a `task_type` based on the use case:
 # 'retrieval.query', 'retrieval.passage', 'separation', 'classification', 'text-matching'
-# Alternatively, you can choose not to pass a `task_type`, and no specific LoRA adapter will be used.
-embeddings = model.encode(texts, task_type="text-matching")
 # Compute similarities
 print(embeddings[0] @ embeddings[1].T)
@@ -21680,11 +21675,11 @@ from sentence_transformers import SentenceTransformer
 model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
-task_type = "retrieval.query"
 embeddings = model.encode(
     ["What is the weather like in Berlin today?"],
-    task_type=task_type,
-    prompt_name=task_type,
 )
 ```
@@ -21720,53 +21715,6 @@ outputs = session.run(None, inputs)
-## Performance
-### English MTEB
-| Model                          |  Dimension  |  Average  | Classification   | Clustering   | Pair Classification   | Reranking   | Retrieval   | STS  | Summarization   |
-|:------------------------------:|:-----------:|:---------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
-| jina-embeddings-v3             |    1024     | **65.60** | **82.58**| 45.27| 84.01| 58.13| 53.87| **85.8** | 30.98|
-| jina-embeddings-v2-en          |     768     |   58.12   |   68.82    | 40.08| 84.44| 55.09| 45.64| 80.00| 30.56|
-| text-embedding-3-large         |    3072     |   62.03   |   75.45    | 49.01| 84.22| 59.16| 55.44| 81.04| 29.92|
-| multilingual-e5-large-instruct |    1024     |   64.41   |   77.56   | 47.1 | 86.19| 58.58| 52.47| 84.78| 30.39|
-| Cohere-embed-multilingual-v3.0 | 1024 |   60.08   |   64.01   | 46.6 | 86.15| 57.86| 53.84| 83.15| 30.99|
-### Multilingual MTEB
-|             Model              | Dimension |  Average  | Classification | Clustering | Pair Classification | Reranking | Retrieval |    STS    | Summarization |
-|:------------------------------:|:---------:|:---------:|:--------------:|:----------:|:-------------------:|:---------:|:---------:|:---------:|:-------------:|
-|       jina-embeddings-v3       |   1024    | **64.44** |   **71.46**    |   46.71    |        76.91        |   63.98   |   57.98   | **69.83** |       -       |
-|     multilingual-e5-large      |   1024    |   59.58   |     65.22      |   42.12    |        76.95        |   63.4    |   52.37   |   64.65   |       -       |
-| multilingual-e5-large-instruct |   1024    |   64.25   |     67.45      | **52.12**  |        77.79        | **69.02** | **58.38** |   68.77   |       -       |
-### Long Context Tasks (LongEmbed)
-|         Model          | Dimension |  Average  | NarrativeQA |  Needle   |  Passkey   |   QMSum   | SummScreen |  WikiQA   |
-|:----------------------:|:---------:|:---------:|:-----------:|:---------:|:----------:|:---------:|:----------:|:---------:|
-|  jina-embeddings-v3*   |   1024    | **70.39** |    33.32    | **84.00** | **100.00** | **39.75** |   92.78    |   72.46   |
-|   jina-embeddings-v2   |    768    |   58.12   |    37.89    |   54.25   |   50.25    |   38.87   |   93.48    |   73.99   |
-| text-embedding-3-large |   3072    |   51.30   |    44.09    |   29.25   |   63.00    |   32.49   |   84.80    |   54.16   |
-|      baai-bge-m3       |   1024    |   56.56   |  **45.76**  |   40.25   |   46.00    |   35.54   | **94.09**  | **77.73** |
-Notes: `*`, use the text-matching adapter
-#### Matryoshka Embeddings
-|  Dimension  |  Retrieval  |   STS   |
-|:-----------:|:-----------:|:-------:|
-|     32      |    52.54    |  76.35  |
-|     64      |    58.54    |  77.03  |
-|     128     |    61.64    |  77.43  |
-|     256     |    62.72    |  77.56  |
-|     512     |    63.16    |  77.59  |
-|     768     |    63.3     |  77.59  |
-|    1024     |    63.35    |  77.58  |
-For a comprehensive evaluation and detailed metrics, please refer to the full paper available here (coming soon).
 ## Contact
 Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
@@ -21776,5 +21724,14 @@ Join our [Discord community](https://discord.jina.ai) and chat with other commun
 If you find `jina-embeddings-v3` useful in your research, please cite the following paper:
 ```bibtex
 ```

 </p>
 <p align="center">
+<b>jina-embeddings-v3: Multilingual Embeddings With Task LoRA</b>
 </p>
 ## Quick Start
 `jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
 Based on the [Jina-XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
+this model supports Rotary Position Embeddings to handle long input sequences up to **8192 tokens**.
+Additionally, it features 5 LoRA adapters to generate task-specific embeddings efficiently.
 ### Key Features:
 - **Extended Sequence Length:** Supports up to 8192 tokens with RoPE.
+- **Task-Specific Embedding:** Customize embeddings through the `task` argument with the following options:
     - `retrieval.query`: Used for query embeddings in asymmetric retrieval tasks
     - `retrieval.passage`: Used for passage embeddings in asymmetric retrieval tasks
     - `separation`: Used for embeddings in clustering and re-ranking applications
 Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
 Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
 ## Usage
 **<details><summary>Apply mean pooling when integrating the model.</summary>**
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():
+    model_output = model(**encoded_input, task='retrieval.query')
 embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
 embeddings = F.normalize(embeddings, p=2, dim=1)
     "Folge dem weißen Kaninchen.",  # German
 ]
+# When calling the `encode` function, you can choose a `task` based on the use case:
 # 'retrieval.query', 'retrieval.passage', 'separation', 'classification', 'text-matching'
+# Alternatively, you can choose not to pass a `task`, and no specific LoRA adapter will be used.
+embeddings = model.encode(texts, task="text-matching")
 # Compute similarities
 print(embeddings[0] @ embeddings[1].T)
 model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
+task = "retrieval.query"
 embeddings = model.encode(
     ["What is the weather like in Berlin today?"],
+    task=task,
+    prompt_name=task,
 )
 ```
 ## Contact
 Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
 If you find `jina-embeddings-v3` useful in your research, please cite the following paper:
 ```bibtex
+@misc{sturua2024jinaembeddingsv3multilingualembeddingstask,
+      title={jina-embeddings-v3: Multilingual Embeddings With Task LoRA},
+      author={Saba Sturua and Isabelle Mohr and Mohammad Kalim Akram and Michael Günther and Bo Wang and Markus Krimmel and Feng Wang and Georgios Mastrapas and Andreas Koukounas and Andreas Koukounas and Nan Wang and Han Xiao},
+      year={2024},
+      eprint={2409.10173},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2409.10173},
+}
 ```

custom_st.py CHANGED Viewed

@@ -91,19 +91,19 @@ class Transformer(nn.Module):
             self.auto_model.config.tokenizer_class = self.tokenizer.__class__.__name__
     def forward(
-        self, features: Dict[str, torch.Tensor], task_type: Optional[str] = None
     ) -> Dict[str, torch.Tensor]:
         """Returns token_embeddings, cls_token"""
-        if task_type and task_type not in self._lora_adaptations:
             raise ValueError(
-                f"Unsupported task '{task_type}'. "
                 f"Supported tasks are: {', '.join(self.config.lora_adaptations)}."
-                f"Alternatively, don't pass the `task_type` argument to disable LoRA."
             )
         adapter_mask = None
-        if task_type:
-            task_id = self._adaptation_map[task_type]
             num_examples = features['input_ids'].size(0)
             adapter_mask = torch.full(
                 (num_examples,), task_id, dtype=torch.int32, device=features['input_ids'].device

             self.auto_model.config.tokenizer_class = self.tokenizer.__class__.__name__
     def forward(
+        self, features: Dict[str, torch.Tensor], task: Optional[str] = None
     ) -> Dict[str, torch.Tensor]:
         """Returns token_embeddings, cls_token"""
+        if task and task not in self._lora_adaptations:
             raise ValueError(
+                f"Unsupported task '{task}'. "
                 f"Supported tasks are: {', '.join(self.config.lora_adaptations)}."
+                f"Alternatively, don't pass the `task` argument to disable LoRA."
             )
         adapter_mask = None
+        if task:
+            task_id = self._adaptation_map[task]
             num_examples = features['input_ids'].size(0)
             adapter_mask = torch.full(
                 (num_examples,), task_id, dtype=torch.int32, device=features['input_ids'].device

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:17ca06efd886a065d0081912b04c9e27ef5086a9dd09659cce32aa9c84587f23
+size 1144685320

modules.json CHANGED Viewed

@@ -4,7 +4,7 @@
     "name": "0",
     "path": "",
     "type": "custom_st.Transformer",
-    "kwargs": ["task_type"]
   },
   {
     "idx": 1,

     "name": "0",
     "path": "",
     "type": "custom_st.Transformer",
+    "kwargs": ["task"]
   },
   {
     "idx": 1,