eugene-yang
commited on
Commit
•
d918e00
1
Parent(s):
954fc3e
git update readme
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
language:
|
3 |
- en
|
4 |
- zh
|
5 |
- fa
|
@@ -35,9 +35,9 @@ license: mit
|
|
35 |
Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation.
|
36 |
`plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng` is trained with KL-Divergence from the `mt5xxl` MonoT5 reranker
|
37 |
[`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)
|
38 |
-
inferenced on English MS MARCO training queries and passages.
|
39 |
-
The teacher scores can be found in
|
40 |
-
[`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores/blob/main/t53b-monot5-msmarco-engeng.jsonl.gz).
|
41 |
|
42 |
### Training Parameters
|
43 |
|
@@ -49,18 +49,18 @@ The teacher scores can be found in
|
|
49 |
|
50 |
### Mixing Strategies
|
51 |
|
52 |
-
- `mix-passages`: languages are randomly assigned to the 6 sampled passages for a given query during training.
|
53 |
-
- `mix-entries`: all passages in the a given query-passage set are randomly assigned to the same language.
|
54 |
-
- `round-robin-entires`: for each query, the query-passage set is repeated `n` times to iterate through all languages.
|
55 |
|
56 |
## Usage
|
57 |
|
58 |
-
To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
|
59 |
```bash
|
60 |
pip install PLAID-X>=0.3.1
|
61 |
```
|
62 |
|
63 |
-
Following code snippet loads the model through Huggingface API.
|
64 |
```python
|
65 |
from colbert.modeling.checkpoint import Checkpoint
|
66 |
from colbert.infra import ColBERTConfig
|
@@ -68,12 +68,12 @@ from colbert.infra import ColBERTConfig
|
|
68 |
Checkpoint('hltcoe/plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng', colbert_config=ColBERTConfig())
|
69 |
```
|
70 |
|
71 |
-
For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
|
72 |
-
which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).
|
73 |
|
74 |
## BibTeX entry and Citation Info
|
75 |
|
76 |
-
Please cite the following two papers if you use the model.
|
77 |
|
78 |
|
79 |
```bibtex
|
@@ -93,5 +93,6 @@ Please cite the following two papers if you use the model.
|
|
93 |
title = {Distillation for Multilingual Information Retrieval},
|
94 |
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)},
|
95 |
year = {2024}
|
|
|
96 |
}
|
97 |
```
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
- en
|
4 |
- zh
|
5 |
- fa
|
|
|
35 |
Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation.
|
36 |
`plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng` is trained with KL-Divergence from the `mt5xxl` MonoT5 reranker
|
37 |
[`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)
|
38 |
+
inferenced on English MS MARCO training queries and passages.
|
39 |
+
The teacher scores can be found in
|
40 |
+
[`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores/blob/main/t53b-monot5-msmarco-engeng.jsonl.gz).
|
41 |
|
42 |
### Training Parameters
|
43 |
|
|
|
49 |
|
50 |
### Mixing Strategies
|
51 |
|
52 |
+
- `mix-passages`: languages are randomly assigned to the 6 sampled passages for a given query during training.
|
53 |
+
- `mix-entries`: all passages in the a given query-passage set are randomly assigned to the same language.
|
54 |
+
- `round-robin-entires`: for each query, the query-passage set is repeated `n` times to iterate through all languages.
|
55 |
|
56 |
## Usage
|
57 |
|
58 |
+
To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
|
59 |
```bash
|
60 |
pip install PLAID-X>=0.3.1
|
61 |
```
|
62 |
|
63 |
+
Following code snippet loads the model through Huggingface API.
|
64 |
```python
|
65 |
from colbert.modeling.checkpoint import Checkpoint
|
66 |
from colbert.infra import ColBERTConfig
|
|
|
68 |
Checkpoint('hltcoe/plaidx-large-neuclir-mtd-mix-entries-mt5xxl-engeng', colbert_config=ColBERTConfig())
|
69 |
```
|
70 |
|
71 |
+
For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
|
72 |
+
which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).
|
73 |
|
74 |
## BibTeX entry and Citation Info
|
75 |
|
76 |
+
Please cite the following two papers if you use the model.
|
77 |
|
78 |
|
79 |
```bibtex
|
|
|
93 |
title = {Distillation for Multilingual Information Retrieval},
|
94 |
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)},
|
95 |
year = {2024}
|
96 |
+
url = {https://arxiv.org/abs/2405.00977}
|
97 |
}
|
98 |
```
|