auhide commited on
Commit
92e884b
1 Parent(s): a2a7fc6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ license: cc-by-4.0
4
+ datasets:
5
+ - wikiann
6
+ language:
7
+ - bg
8
+ metrics:
9
+ - accuracy
10
+ ---
11
+
12
+ # 🇧🇬 BERT - Bulgarian Named Entity Recognition
13
+ KeyBERT-BG is a model trained for a keyword extraction task in Bulgarian.
14
+
15
+ ## Usage
16
+ Import the libraries:
17
+ ```python
18
+ import re
19
+ from typing import Dict
20
+ from pprint import pprint
21
+
22
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
23
+ ```
24
+
25
+ Firstly, you'll have to define this method, since the text preprocessing is custom and the standard `pipeline` method won't suffice:
26
+ ```python
27
+ def get_keywords(
28
+ text: str,
29
+ model_id="auhide/keybert-bg",
30
+ max_len: int = 300,
31
+ id2group: Dict[int, str] = {
32
+ # Indicates that this is not a keyword.
33
+ 0: "O",
34
+ # Begining of keyword.
35
+ 1: "B-KWD",
36
+ # Additional keywords (might also indicate the end of a keyword sequence).
37
+ # You can merge these with the begining keyword `B-KWD`.
38
+ 2: "I-KWD",
39
+ }
40
+ ):
41
+ # Initialize the tokenizer and model.
42
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
43
+ keybert = AutoModelForTokenClassification.from_pretrained(model_id)
44
+
45
+ # Preprocess the text.
46
+ # Surround punctuation with whitespace and convert multiple whitespaces
47
+ # into single ones.
48
+ text = re.sub(r"([,\.?!;:\'\"\(\)\[\]„”])", r" \1 ", text)
49
+ text = re.sub(r"\s+", r" ", text)
50
+ words = text.split()
51
+
52
+ # Tokenize the processed `text` (this includes padding or truncation).
53
+ tokens_data = tokenizer(
54
+ text.strip(),
55
+ padding="max_length",
56
+ max_length=max_len,
57
+ truncation=True,
58
+ return_tensors="pt"
59
+ )
60
+ input_ids = tokens_data.input_ids
61
+ attention_mask = tokens_data.attention_mask
62
+
63
+ # Predict the keywords.
64
+ out = keybert(input_ids, attention_mask=attention_mask).logits
65
+ # Softmax the last dimension so that the probabilities add up to 1.0.
66
+ out = out.softmax(-1)
67
+ # Based on the probabilities, generate the most probable keywords.
68
+ out_argmax = out.argmax(-1)
69
+ prediction = out_argmax.squeeze(0).tolist()
70
+ probabilities = out.squeeze(0)
71
+
72
+ return [
73
+ {
74
+ # Since the list of words does not have a [CLS] token, the index `i`
75
+ # is one step forward, which means that if we want to access the
76
+ # appropriate keyword we should use the index `i - 1`.
77
+ "entity": words[i - 1],
78
+ "entity_group": id2group[idx],
79
+ "score": float(probabilities[i, idx])
80
+ }
81
+ for i, idx in enumerate(prediction)
82
+ if idx == 1 or idx == 2
83
+ ]
84
+ ```
85
+
86
+ Choose a text and use the model on it. For example, I've chosen to use [this](https://www.24chasa.bg/bulgaria/article/14466321) article.
87
+ Then, you can call `get_keywords` on it and extract its keywords:
88
+ ```python
89
+ # Reading the text from a file, since it is an article, and the text is large.
90
+ with open("input_text.txt", "r", encoding="utf-8") as f:
91
+ text = f.read()
92
+
93
+ keywords = get_keywords(text)
94
+ print("Keywords:")
95
+ pprint(keywords)
96
+ ```
97
+ ```sh
98
+ Keywords:
99
+ [{'entity': 'Пловдив', 'entity_group': 'B-KWD', 'score': 0.7669068574905396},
100
+ {'entity': 'Шофьорът', 'entity_group': 'B-KWD', 'score': 0.9119699597358704},
101
+ {'entity': 'катастрофа', 'entity_group': 'B-KWD', 'score': 0.8441269993782043}]
102
+ ```