AgaMiko commited on
Commit
ba1318a
1 Parent(s): 2ca37ef

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -3
README.md CHANGED
@@ -1,3 +1,63 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SHerbert - Polish SentenceBERT
2
+ SentenceBERT is a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. Training was based on the original paper [Siamese BERT models for the task of semantic textual similarity (STS)](https://arxiv.org/abs/1908.10084) with a slight modification of how the training data was used. The goal of the model is to generate different embeddings based on the semantic and topic similarity of the given text.
3
+
4
+ > Semantic textual similarity analyzes how similar two pieces of texts are.
5
+
6
+ Read more about how the model was prepared in our [blog post](https://voicelab.ai/blog/).
7
+
8
+ The base trained model is a Polish HerBERT. HerBERT is a BERT-based Language Model. For more details, please refer to: "HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish".
9
+
10
+ # Corpus
11
+ Te model was trained solely on [Wikipedia](https://dumps.wikimedia.org/).
12
+
13
+
14
+ # Tokenizer
15
+
16
+ As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
17
+
18
+ We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.
19
+
20
+ # Usage
21
+
22
+ ```python
23
+ from transformers import AutoTokenizer, AutoModel
24
+ from sklearn.metrics import pairwise
25
+
26
+ sbert = AutoModel.from_pretrained("voicelab/sbert-base")
27
+ tokenizer = AutoTokenizer.from_pretrained("voicelab/sbert-base")
28
+
29
+ s0 = "Uczenie maszynowe jest konsekwencją rozwoju idei sztucznej inteligencji i metod jej wdrażania praktycznego."
30
+ s1 = "Głębokie uczenie maszynowe jest sktukiem wdrażania praktycznego metod sztucznej inteligencji oraz jej rozwoju."
31
+ s2 = "Kasparow zarzucił firmie IBM oszustwo, kiedy odmówiła mu dostępu do historii wcześniejszych gier Deep Blue. "
32
+
33
+
34
+ tokens = tokenizer([s0, s1, s2],
35
+ padding=True,
36
+ truncation=True,
37
+ return_tensors='pt')
38
+ x = sbert(tokens["input_ids"],
39
+ tokens["attention_mask"]).pooler_output
40
+
41
+ # similarity between sentences s0 and s1
42
+ print(pairwise.cosine_similarity(x[0], x[1])) # Result: 0.7952354
43
+
44
+ # similarity between sentences s0 and s2
45
+ print(pairwise.cosine_similarity(x[0], x[2))) # Result: 0.42359722
46
+
47
+ ```
48
+
49
+
50
+ # License
51
+
52
+ CC BY 4.0
53
+
54
+ # Citation
55
+
56
+ If you use this model, please cite the following paper:
57
+
58
+
59
+ # Authors
60
+
61
+ The model was trained by NLP Research Team at Voicelab.ai.
62
+
63
+ You can contact us [here](https://voicelab.ai/contact/).