File size: 5,314 Bytes
be4fdf4
 
 
 
 
 
 
 
 
 
 
1cb1b74
 
be4fdf4
 
 
 
 
 
 
 
 
 
 
 
 
 
1cb1b74
be4fdf4
 
1cb1b74
 
be4fdf4
 
1cb1b74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
---

# whaleloops/phrase-bert

This is the official repository for the EMNLP 2021 long paper [Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration](https://arxiv.org/abs/2109.06304). We provide [code](https://github.com/sf-wa-326/phrase-bert-topic-model) for training and evaluating Phrase-BERT in addition to the datasets used in the paper.



## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']

model = SentenceTransformer('whaleloops/phrase-bert')
phrase_embs = model.encode( phrase_list )
[p1, p2, p3] = phrase_embs
```

As in sentence-BERT, the default output is a list of numpy arrays:
````
for phrase, embedding in zip(phrase_list, phrase_embs):
    print("Phrase:", phrase)
    print("Embedding:", embedding)
    print("")
````

An example of computing the dot product of phrase embeddings:
````
import numpy as np
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')
````

An example of computing cosine similarity of phrase embeddings:
````
import torch 
from torch import nn
cos_sim = nn.CosineSimilarity(dim=0)
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')
````

The output should look like:
````
The dot product between phrase 1 and 2 is: 218.43600463867188
The dot product between phrase 1 and 3 is: 165.48483276367188
The dot product between phrase 2 and 3 is: 160.51708984375
The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
The cosine similarity between phrase 2 and 3 is: 0.584893524646759
````



## Evaluation
Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:

* Turney [[Download](https://storage.googleapis.com/phrase-bert/turney/data.txt) ]
* BiRD [[Download](https://storage.googleapis.com/phrase-bert/bird/data.txt)]
* PPDB [[Download](https://storage.googleapis.com/phrase-bert/ppdb/examples.json)]
* PPDB-filtered [[Download](https://storage.googleapis.com/phrase-bert/ppdb_exact/examples.json)]
* PAWS-short [[Download Train-split](https://storage.googleapis.com/phrase-bert/paws_short/train_examples.json) ] [[Download Dev-split](https://storage.googleapis.com/phrase-bert/paws_short/dev_examples.json) ] [[Download Test-split](https://storage.googleapis.com/phrase-bert/paws_short/test_examples.json) ]


Change `config/model_path.py` with the model path according to your directories and 
* For evaluation on Turney, run `python eval_turney.py`
* For evaluation on BiRD, run `python eval_bird.py`
* for evaluation on PPDB / PPDB-filtered / PAWS-short, run `eval_ppdb_paws.py` with:

    ````
    nohup python  -u eval_ppdb_paws.py \
        --full_run_mode \
        --task <task-name> \
        --data_dir <input-data-dir> \
        --result_dir <result-storage-dr> \
        >./output.txt 2>&1 &
    ````

## Train your own Phrase-BERT
If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to 
`phrase-bert/phrase_bert_finetune.py`

The datasets we used to fine-tune Phrase-BERT are here: [training data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_train.csv) and [validation data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_valid.csv).

To re-produce the trained Phrase-BERT, please run:

    export INPUT_DATA_PATH=<directory-of-phrasebert-finetuning-data>
    export TRAIN_DATA_FILE=<training-data-filename.csv>
    export VALID_DATA_FILE=<validation-data-filename.csv>
    export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens 
    export OUTPUT_MODEL_PATH=<directory-of-saved-model>


    python -u phrase_bert_finetune.py \
        --input_data_path $INPUT_DATA_PATH \
        --train_data_file $TRAIN_DATA_FILE \
        --valid_data_file $VALID_DATA_FILE \
        --input_model_path $INPUT_MODEL_PATH \
        --output_model_path $OUTPUT_MODEL_PATH

## Citation:
Please cite us if you find this useful:
````
@inproceedings{phrasebertwang2021,
    author={Shufan Wang and Laure Thompson and Mohit Iyyer},
    Booktitle = {Empirical Methods in Natural Language Processing},
    Year = "2021",
    Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
}
````