File size: 3,074 Bytes
cd3499a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# kwang2049/TSDAE-askubuntu2nli_stsb

This is a model from the paper ["TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning"](https://arxiv.org/abs/2104.06979). This model adapts the knowledge from the NLI and STSb data to the specific domain AskUbuntu. Training procedure of this model:
 1. Initialized with [bert-base-uncased](https://huggingface.co/bert-base-uncased);
 2. Unsupervised training on AskUbuntu with the TSDAE objective;
 3. Supervised training on the NLI data with cross-entropy loss;
 4. Supervised training on the STSb data with MSE loss.
 
 The pooling method is CLS-pooling.
 
 ## Usage
 To use this model, an convenient way is through [SentenceTransformers](https://github.com/UKPLab/sentence-transformers). So please install it via:
 ```bash
 pip install sentence-transformers
 ```
 And then load the model and use it to encode sentences:
 ```python
 from sentence_transformers import SentenceTransformer, models
 dataset = 'askubuntu'
 model_name_or_path = f'kwang2049/TSDAE-{dataset}2nli_stsb'
 model = SentenceTransformer(model_name_or_path)
 model[1] = models.Pooling(model[0].get_word_embedding_dimension(), pooling_mode='cls')  # Note this model uses CLS-pooling
 sentence_embeddings = model.encode(['This is the first sentence.', 'This is the second one.'])
 ```

 ## Evaluation
 To evaluate the model against the datasets used in the paper, please install our evaluation toolkit [USEB](https://github.com/UKPLab/useb):
 ```bash
 pip install useb  # Or git clone and pip install .
 python -m useb.downloading all  # Download both training and evaluation data
 ```
 And then do the evaluation:
 ```python
 from sentence_transformers import SentenceTransformer, models
import torch
from useb import run_on

dataset = 'askubuntu'
model_name_or_path = f'kwang2049/TSDAE-{dataset}2nli_stsb'
model = SentenceTransformer(model_name_or_path)
model[1] = models.Pooling(model[0].get_word_embedding_dimension(), pooling_mode='cls')  # Note this model uses CLS-pooling

@torch.no_grad()
def semb_fn(sentences) -> torch.Tensor:
    return torch.Tensor(model.encode(sentences, show_progress_bar=False))

result = run_on(
    dataset,
    semb_fn=semb_fn,
    eval_type='test',
    data_eval_path='data-eval'
)
 ```
 
 ## Training
 Please refer to [the page of TSDAE training](https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/TSDAE) in SentenceTransformers.
 
 ## Cite & Authors
 If you use the code for evaluation, feel free to cite our publication [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979):
```bibtex 
@article{wang-2021-TSDAE,
    title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
    author = "Wang, Kexin and Reimers, Nils and  Gurevych, Iryna", 
    journal= "arXiv preprint arXiv:2104.06979",
    month = "4",
    year = "2021",
    url = "https://arxiv.org/abs/2104.06979",
}
```