zhichao yang
commited on
Commit
•
1cb1b74
1
Parent(s):
be4fdf4
Update README.md
Browse files
README.md
CHANGED
@@ -9,9 +9,9 @@ tags:
|
|
9 |
|
10 |
# whaleloops/phrase-bert
|
11 |
|
12 |
-
This is
|
|
|
13 |
|
14 |
-
<!--- Describe your model here -->
|
15 |
|
16 |
## Usage (Sentence-Transformers)
|
17 |
|
@@ -25,69 +25,104 @@ Then you can use the model like this:
|
|
25 |
|
26 |
```python
|
27 |
from sentence_transformers import SentenceTransformer
|
28 |
-
|
29 |
|
30 |
model = SentenceTransformer('whaleloops/phrase-bert')
|
31 |
-
|
32 |
-
|
33 |
```
|
34 |
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
# whaleloops/phrase-bert
|
11 |
|
12 |
+
This is the official repository for the EMNLP 2021 long paper [Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration](https://arxiv.org/abs/2109.06304). We provide [code](https://github.com/sf-wa-326/phrase-bert-topic-model) for training and evaluating Phrase-BERT in addition to the datasets used in the paper.
|
13 |
+
|
14 |
|
|
|
15 |
|
16 |
## Usage (Sentence-Transformers)
|
17 |
|
|
|
25 |
|
26 |
```python
|
27 |
from sentence_transformers import SentenceTransformer
|
28 |
+
phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
|
29 |
|
30 |
model = SentenceTransformer('whaleloops/phrase-bert')
|
31 |
+
phrase_embs = model.encode( phrase_list )
|
32 |
+
[p1, p2, p3] = phrase_embs
|
33 |
```
|
34 |
|
35 |
+
As in sentence-BERT, the default output is a list of numpy arrays:
|
36 |
+
````
|
37 |
+
for phrase, embedding in zip(phrase_list, phrase_embs):
|
38 |
+
print("Phrase:", phrase)
|
39 |
+
print("Embedding:", embedding)
|
40 |
+
print("")
|
41 |
+
````
|
42 |
+
|
43 |
+
An example of computing the dot product of phrase embeddings:
|
44 |
+
````
|
45 |
+
import numpy as np
|
46 |
+
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
|
47 |
+
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
|
48 |
+
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')
|
49 |
+
````
|
50 |
+
|
51 |
+
An example of computing cosine similarity of phrase embeddings:
|
52 |
+
````
|
53 |
+
import torch
|
54 |
+
from torch import nn
|
55 |
+
cos_sim = nn.CosineSimilarity(dim=0)
|
56 |
+
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
|
57 |
+
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
|
58 |
+
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')
|
59 |
+
````
|
60 |
+
|
61 |
+
The output should look like:
|
62 |
+
````
|
63 |
+
The dot product between phrase 1 and 2 is: 218.43600463867188
|
64 |
+
The dot product between phrase 1 and 3 is: 165.48483276367188
|
65 |
+
The dot product between phrase 2 and 3 is: 160.51708984375
|
66 |
+
The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
|
67 |
+
The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
|
68 |
+
The cosine similarity between phrase 2 and 3 is: 0.584893524646759
|
69 |
+
````
|
70 |
+
|
71 |
+
|
72 |
+
|
73 |
+
## Evaluation
|
74 |
+
Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:
|
75 |
+
|
76 |
+
* Turney [[Download](https://storage.googleapis.com/phrase-bert/turney/data.txt) ]
|
77 |
+
* BiRD [[Download](https://storage.googleapis.com/phrase-bert/bird/data.txt)]
|
78 |
+
* PPDB [[Download](https://storage.googleapis.com/phrase-bert/ppdb/examples.json)]
|
79 |
+
* PPDB-filtered [[Download](https://storage.googleapis.com/phrase-bert/ppdb_exact/examples.json)]
|
80 |
+
* PAWS-short [[Download Train-split](https://storage.googleapis.com/phrase-bert/paws_short/train_examples.json) ] [[Download Dev-split](https://storage.googleapis.com/phrase-bert/paws_short/dev_examples.json) ] [[Download Test-split](https://storage.googleapis.com/phrase-bert/paws_short/test_examples.json) ]
|
81 |
+
|
82 |
+
|
83 |
+
Change `config/model_path.py` with the model path according to your directories and
|
84 |
+
* For evaluation on Turney, run `python eval_turney.py`
|
85 |
+
* For evaluation on BiRD, run `python eval_bird.py`
|
86 |
+
* for evaluation on PPDB / PPDB-filtered / PAWS-short, run `eval_ppdb_paws.py` with:
|
87 |
+
|
88 |
+
````
|
89 |
+
nohup python -u eval_ppdb_paws.py \
|
90 |
+
--full_run_mode \
|
91 |
+
--task <task-name> \
|
92 |
+
--data_dir <input-data-dir> \
|
93 |
+
--result_dir <result-storage-dr> \
|
94 |
+
>./output.txt 2>&1 &
|
95 |
+
````
|
96 |
+
|
97 |
+
## Train your own Phrase-BERT
|
98 |
+
If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to
|
99 |
+
`phrase-bert/phrase_bert_finetune.py`
|
100 |
+
|
101 |
+
The datasets we used to fine-tune Phrase-BERT are here: [training data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_train.csv) and [validation data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_valid.csv).
|
102 |
+
|
103 |
+
To re-produce the trained Phrase-BERT, please run:
|
104 |
+
|
105 |
+
export INPUT_DATA_PATH=<directory-of-phrasebert-finetuning-data>
|
106 |
+
export TRAIN_DATA_FILE=<training-data-filename.csv>
|
107 |
+
export VALID_DATA_FILE=<validation-data-filename.csv>
|
108 |
+
export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens
|
109 |
+
export OUTPUT_MODEL_PATH=<directory-of-saved-model>
|
110 |
+
|
111 |
+
|
112 |
+
python -u phrase_bert_finetune.py \
|
113 |
+
--input_data_path $INPUT_DATA_PATH \
|
114 |
+
--train_data_file $TRAIN_DATA_FILE \
|
115 |
+
--valid_data_file $VALID_DATA_FILE \
|
116 |
+
--input_model_path $INPUT_MODEL_PATH \
|
117 |
+
--output_model_path $OUTPUT_MODEL_PATH
|
118 |
+
|
119 |
+
## Citation:
|
120 |
+
Please cite us if you find this useful:
|
121 |
+
````
|
122 |
+
@inproceedings{phrasebertwang2021,
|
123 |
+
author={Shufan Wang and Laure Thompson and Mohit Iyyer},
|
124 |
+
Booktitle = {Empirical Methods in Natural Language Processing},
|
125 |
+
Year = "2021",
|
126 |
+
Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
|
127 |
+
}
|
128 |
+
````
|