|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
--- |
|
|
|
# whaleloops/phrase-bert |
|
|
|
This is the official repository for the EMNLP 2021 long paper [Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration](https://arxiv.org/abs/2109.06304). We provide [code](https://github.com/sf-wa-326/phrase-bert-topic-model) for training and evaluating Phrase-BERT in addition to the datasets used in the paper. |
|
|
|
|
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle'] |
|
|
|
model = SentenceTransformer('whaleloops/phrase-bert') |
|
phrase_embs = model.encode( phrase_list ) |
|
[p1, p2, p3] = phrase_embs |
|
``` |
|
|
|
As in sentence-BERT, the default output is a list of numpy arrays: |
|
```` |
|
for phrase, embedding in zip(phrase_list, phrase_embs): |
|
print("Phrase:", phrase) |
|
print("Embedding:", embedding) |
|
print("") |
|
```` |
|
|
|
An example of computing the dot product of phrase embeddings: |
|
```` |
|
import numpy as np |
|
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}') |
|
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}') |
|
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}') |
|
```` |
|
|
|
An example of computing cosine similarity of phrase embeddings: |
|
```` |
|
import torch |
|
from torch import nn |
|
cos_sim = nn.CosineSimilarity(dim=0) |
|
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}') |
|
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}') |
|
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}') |
|
```` |
|
|
|
The output should look like: |
|
```` |
|
The dot product between phrase 1 and 2 is: 218.43600463867188 |
|
The dot product between phrase 1 and 3 is: 165.48483276367188 |
|
The dot product between phrase 2 and 3 is: 160.51708984375 |
|
The cosine similarity between phrase 1 and 2 is: 0.8142536282539368 |
|
The cosine similarity between phrase 1 and 3 is: 0.6130303144454956 |
|
The cosine similarity between phrase 2 and 3 is: 0.584893524646759 |
|
```` |
|
|
|
|
|
|
|
## Evaluation |
|
Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper: |
|
|
|
* Turney [[Download](https://storage.googleapis.com/phrase-bert/turney/data.txt) ] |
|
* BiRD [[Download](https://storage.googleapis.com/phrase-bert/bird/data.txt)] |
|
* PPDB [[Download](https://storage.googleapis.com/phrase-bert/ppdb/examples.json)] |
|
* PPDB-filtered [[Download](https://storage.googleapis.com/phrase-bert/ppdb_exact/examples.json)] |
|
* PAWS-short [[Download Train-split](https://storage.googleapis.com/phrase-bert/paws_short/train_examples.json) ] [[Download Dev-split](https://storage.googleapis.com/phrase-bert/paws_short/dev_examples.json) ] [[Download Test-split](https://storage.googleapis.com/phrase-bert/paws_short/test_examples.json) ] |
|
|
|
|
|
Change `config/model_path.py` with the model path according to your directories and |
|
* For evaluation on Turney, run `python eval_turney.py` |
|
* For evaluation on BiRD, run `python eval_bird.py` |
|
* for evaluation on PPDB / PPDB-filtered / PAWS-short, run `eval_ppdb_paws.py` with: |
|
|
|
```` |
|
nohup python -u eval_ppdb_paws.py \ |
|
--full_run_mode \ |
|
--task <task-name> \ |
|
--data_dir <input-data-dir> \ |
|
--result_dir <result-storage-dr> \ |
|
>./output.txt 2>&1 & |
|
```` |
|
|
|
## Train your own Phrase-BERT |
|
If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to |
|
`phrase-bert/phrase_bert_finetune.py` |
|
|
|
The datasets we used to fine-tune Phrase-BERT are here: [training data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_train.csv) and [validation data csv file](https://storage.googleapis.com/phrase-bert/phrase-bert-ft-data/pooled_context_para_triples_p%3D0.8_valid.csv). |
|
|
|
To re-produce the trained Phrase-BERT, please run: |
|
|
|
export INPUT_DATA_PATH=<directory-of-phrasebert-finetuning-data> |
|
export TRAIN_DATA_FILE=<training-data-filename.csv> |
|
export VALID_DATA_FILE=<validation-data-filename.csv> |
|
export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens |
|
export OUTPUT_MODEL_PATH=<directory-of-saved-model> |
|
|
|
|
|
python -u phrase_bert_finetune.py \ |
|
--input_data_path $INPUT_DATA_PATH \ |
|
--train_data_file $TRAIN_DATA_FILE \ |
|
--valid_data_file $VALID_DATA_FILE \ |
|
--input_model_path $INPUT_MODEL_PATH \ |
|
--output_model_path $OUTPUT_MODEL_PATH |
|
|
|
## Citation: |
|
Please cite us if you find this useful: |
|
```` |
|
@inproceedings{phrasebertwang2021, |
|
author={Shufan Wang and Laure Thompson and Mohit Iyyer}, |
|
Booktitle = {Empirical Methods in Natural Language Processing}, |
|
Year = "2021", |
|
Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration} |
|
} |
|
```` |
|
|