jupyterjazz
commited on
Commit
•
f3b0d18
1
Parent(s):
1ecacfa
Update README.md
Browse files
README.md
CHANGED
@@ -123,11 +123,12 @@ library_name: transformers
|
|
123 |
The easiest way to starting using `jina-embeddings-v3` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
|
124 |
|
125 |
|
126 |
-
## Intended Usage & Model
|
127 |
|
128 |
|
129 |
`jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
|
130 |
-
Based on the [XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
|
|
|
131 |
Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
|
132 |
|
133 |
### Key Features:
|
@@ -143,11 +144,14 @@ Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to g
|
|
143 |
### Model Lineage:
|
144 |
|
145 |
`jina-embeddings-v3` builds upon the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, which was originally trained on 100 languages.
|
146 |
-
We extended its capabilities with an extra pretraining phase on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset,
|
|
|
147 |
|
148 |
### Supported Languages:
|
149 |
While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages to maximize performance:
|
150 |
-
**Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
|
|
|
|
|
151 |
|
152 |
|
153 |
## Data & Parameters
|
|
|
123 |
The easiest way to starting using `jina-embeddings-v3` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
|
124 |
|
125 |
|
126 |
+
## Intended Usage & Model Info
|
127 |
|
128 |
|
129 |
`jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications.
|
130 |
+
Based on the [XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation),
|
131 |
+
this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long sequences up to **8192 tokens**.
|
132 |
Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently.
|
133 |
|
134 |
### Key Features:
|
|
|
144 |
### Model Lineage:
|
145 |
|
146 |
`jina-embeddings-v3` builds upon the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, which was originally trained on 100 languages.
|
147 |
+
We extended its capabilities with an extra pretraining phase on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset,
|
148 |
+
then contrastively fine-tuned it on 30 languages for enhanced performance in both monolingual and cross-lingual setups.
|
149 |
|
150 |
### Supported Languages:
|
151 |
While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages to maximize performance:
|
152 |
+
**Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek,
|
153 |
+
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian,
|
154 |
+
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
155 |
|
156 |
|
157 |
## Data & Parameters
|