File size: 5,793 Bytes
d2f8f8e
75411b5
 
 
e6d6d4f
75411b5
 
 
8beb5b9
d2f8f8e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75411b5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
language: en
datasets:
- tner/bc5cdr
- commanderstrife/jnlpba
- bc2gm_corpus
- drAbreu/bc4chemd_ner
- linnaeus
- chintagunta85/ncbi_disease
---

# SciBERT finetuned on JNLPA for NER downstream task
## Language Model
 [SciBERT](https://arxiv.org/pdf/1903.10676.pdf) is a pretrained language model based on BERT and trained by the 
 [Allen Institute for AI](https://allenai.org/) on papers from the corpus of 
 [Semantic Scholar](https://www.semanticscholar.org/). 
 Corpus size is 1.14M papers, 3.1B tokens. SciBERT has its own vocabulary (scivocab) that's built to best match 
 the training corpus.
 
## Downstream task
[`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased#) has been finetuned for Named Entity 
Recognition (NER) dowstream task. The code to train the NER can be found [here](https://github.com/fran-martinez/bio_ner_bert).

### Data
The corpus used to fine-tune the NER is [BioNLP / JNLPBA shared task](http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004).

- Training data consist of 2,000 PubMed abstracts with term/word annotation. This corresponds to 18,546 samples (senteces).
- Evaluation data consist of 404 PubMed abstracts with term/word annotation. This corresponds to 3,856 samples (sentences).

The classes (at word level) and its distribution (number of examples for each class) for training and evaluation datasets are shown below:
 
| Class Label         | # training examples| # evaluation examples|
|:--------------|--------------:|----------------:|
|O              |   382,963     |     81,647      |
|B-protein      |    30,269     |      5,067      |
|I-protein      |    24,848     |      4,774      |
|B-cell_type    |     6,718     |      1,921      |
|I-cell_type    |     8,748     |      2,991      |
|B-DNA          |     9,533     |      1,056      |
|I-DNA          |    15,774     |      1,789      |
|B-cell_line    |     3,830     |        500      |
|I-cell_line    |     7,387     |       9,89      |
|B-RNA          |       951     |        118      |
|I-RNA          |     1,530     |        187      |

### Model
An exhaustive hyperparameter search was done.
The hyperparameters that provided the best results are:

- Max length sequence: 128
- Number of epochs: 6
- Batch size: 32
- Dropout: 0.3
- Optimizer: Adam

The used learning rate was 5e-5 with a decreasing linear schedule. A warmup was used at the beggining of the training
with a ratio of steps equal to 0.1 from the total training steps.

The model from the epoch with the best F1-score was selected, in this case, the model from epoch 5.


### Evaluation
The following table shows the evaluation metrics calculated at span/entity level:

|          |   precision|    recall|  f1-score|   
|:---------|-----------:|---------:|---------:|
cell_line   |  0.5205   | 0.7100   | 0.6007   | 
cell_type   |  0.7736   | 0.7422   | 0.7576   |
protein     |  0.6953   | 0.8459   | 0.7633   |
DNA         |  0.6997   | 0.7894   | 0.7419   | 
RNA         |  0.6985   | 0.8051   | 0.7480   | 
|           |          |          |
**micro avg**   |  0.6984   | 0.8076  |  0.7490|
**macro avg**   | 0.7032   | 0.8076   | 0.7498 |

The macro F1-score is equal to 0.7498, compared to the value provided by the Allen Institute for AI in their
[paper](https://arxiv.org/pdf/1903.10676.pdf), which is equal to 0.7728. This drop in performance could be due to 
several reasons, but one hypothesis could be the fact that the authors used an additional conditional random field, 
while this model uses a regular classification layer with softmax activation on top of SciBERT model.

At word level, this model achieves a precision of 0.7742, a recall of 0.8536 and a F1-score of 0.8093.

### Model usage in inference
Use the pipeline:
````python
from transformers import pipeline

text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."

nlp_ner = pipeline("ner",
                   model='fran-martinez/scibert_scivocab_cased_ner_jnlpba',
                   tokenizer='fran-martinez/scibert_scivocab_cased_ner_jnlpba')

nlp_ner(text)

"""
Output:
---------------------------
[
{'word': 'glucocorticoid', 
'score': 0.9894881248474121, 
'entity': 'B-protein'}, 
 
{'word': 'receptor', 
'score': 0.989505410194397, 
'entity': 'I-protein'}, 

{'word': 'normal', 
'score': 0.7680378556251526, 
'entity': 'B-cell_type'}, 

{'word': 'cs', 
'score': 0.5176806449890137, 
'entity': 'I-cell_type'}, 

{'word': 'lymphocytes', 
'score': 0.9898491501808167, 
'entity': 'I-cell_type'}
]
"""
````
Or load model and tokenizer as follows:
````python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Example
text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."

# Load model
tokenizer = AutoTokenizer.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
model = AutoModelForTokenClassification.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")

# Get input for BERT
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)

# Predict
with torch.no_grad():
  outputs = model(input_ids)

# From the output let's take the first element of the tuple.
# Then, let's get rid of [CLS] and [SEP] tokens (first and last)
predictions = outputs[0].argmax(axis=-1)[0][1:-1]

# Map label class indexes to string labels.
for token, pred in zip(tokenizer.tokenize(text), predictions):
  print(token, '->', model.config.id2label[pred.numpy().item()])

"""
Output:
---------------------------
mouse -> O
thymus -> O
was -> O
used -> O
as -> O
a -> O
source -> O
of -> O
glucocorticoid -> B-protein
receptor -> I-protein
from -> O
normal -> B-cell_type
cs -> I-cell_type
lymphocytes -> I-cell_type
. -> O
"""
````