kwang2049 commited on
Commit
425ea67
1 Parent(s): 49e9dbc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # kwang2049/TSDAE-scidocs2nli_stsb
2
+ This is a model from the paper ["TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning"](https://arxiv.org/abs/2104.06979). This model adapts the knowledge from the NLI and STSb data to the specific domain scidocs. Training procedure of this model:
3
+ 1. Initialized with [bert-base-uncased](https://huggingface.co/bert-base-uncased);
4
+ 2. Unsupervised training on scidocs with the TSDAE objective;
5
+ 3. Supervised training on the NLI data with cross-entropy loss;
6
+ 4. Supervised training on the STSb data with MSE loss.
7
+
8
+ The pooling method is CLS-pooling.
9
+
10
+ ## Usage
11
+ To use this model, an convenient way is through [SentenceTransformers](https://github.com/UKPLab/sentence-transformers). So please install it via:
12
+ ```bash
13
+ pip install sentence-transformers
14
+ ```
15
+ And then load the model and use it to encode sentences:
16
+ ```python
17
+ from sentence_transformers import SentenceTransformer, models
18
+ dataset = 'scidocs'
19
+ model_name_or_path = f'kwang2049/TSDAE-{dataset}2nli_stsb'
20
+ model = SentenceTransformer(model_name_or_path)
21
+ model[1] = models.Pooling(model[0].get_word_embedding_dimension(), pooling_mode='cls') # Note this model uses CLS-pooling
22
+ sentence_embeddings = model.encode(['This is the first sentence.', 'This is the second one.'])
23
+ ```
24
+ ## Evaluation
25
+ To evaluate the model against the datasets used in the paper, please install our evaluation toolkit [USEB](https://github.com/UKPLab/useb):
26
+ ```bash
27
+ pip install useb # Or git clone and pip install .
28
+ python -m useb.downloading all # Download both training and evaluation data
29
+ ```
30
+ And then do the evaluation:
31
+ ```python
32
+ from sentence_transformers import SentenceTransformer, models
33
+ import torch
34
+ from useb import run_on
35
+ dataset = 'scidocs'
36
+ model_name_or_path = f'kwang2049/TSDAE-{dataset}2nli_stsb'
37
+ model = SentenceTransformer(model_name_or_path)
38
+ model[1] = models.Pooling(model[0].get_word_embedding_dimension(), pooling_mode='cls') # Note this model uses CLS-pooling
39
+ @torch.no_grad()
40
+ def semb_fn(sentences) -> torch.Tensor:
41
+ return torch.Tensor(model.encode(sentences, show_progress_bar=False))
42
+ result = run_on(
43
+ dataset,
44
+ semb_fn=semb_fn,
45
+ eval_type='test',
46
+ data_eval_path='data-eval'
47
+ )
48
+ ```
49
+
50
+ ## Training
51
+ Please refer to [the page of TSDAE training](https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/TSDAE) in SentenceTransformers.
52
+
53
+ ## Cite & Authors
54
+ If you use the code for evaluation, feel free to cite our publication [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979):
55
+ ```bibtex
56
+ @article{wang-2021-TSDAE,
57
+ title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
58
+ author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna",
59
+ journal= "arXiv preprint arXiv:2104.06979",
60
+ month = "4",
61
+ year = "2021",
62
+ url = "https://arxiv.org/abs/2104.06979",
63
+ }
64
+ ```