File size: 8,922 Bytes
b519cf9
88b0888
 
 
 
 
 
ee82c0f
 
 
b519cf9
88b0888
b519cf9
 
ee82c0f
b519cf9
 
5306dec
f6db68b
88b0888
 
 
ff3eea6
88b0888
 
0d196a8
54553f6
5306dec
54553f6
 
 
 
 
 
 
 
 
 
 
 
5306dec
d772cf1
88b0888
 
0aa8c44
 
12cbf3e
 
 
1a7d487
88b0888
 
5306dec
db06d31
f6db68b
 
 
 
 
 
01078a4
 
 
 
f6db68b
 
 
 
 
 
1a7d487
f6db68b
 
54553f6
f6db68b
0d196a8
1e03f3b
88b0888
 
5306dec
 
 
46c9e1a
 
 
54553f6
46c9e1a
54553f6
 
 
 
 
 
 
 
5306dec
 
65757ca
f6db68b
 
2438bff
5199800
5306dec
5199800
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5306dec
 
5199800
 
 
2438bff
f6db68b
5306dec
619e946
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5306dec
 
619e946
f6db68b
88b0888
 
5306dec
84cfdc8
88b0888
 
ee82c0f
 
 
 
 
35a28f9
ee82c0f
 
88b0888
5306dec
ee82c0f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: relation_extraction
datasets:
- none
tags:
- evaluate
- metric
description: >-
  This metric is used for evaluating the F1 accuracy of input references and
  predictions.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
license: apache-2.0
---

# Metric Card for relation_extraction evaluation
This metric is used for evaluating the quality of relation extraction output. By calculating the Micro and Macro F1 score of every relation extraction outputs to ensure the quality.


## Metric Description
This metric computes and returns various scoring metrics for the prediction model based on the mode specified, including Precision, Recall, F1-Score and others. It evaluates the model's predictions against the provided reference data.

## How to Use
```python
import evaluate
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
  [
    {"head": "phip igments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ]
]
predictions = [
  [
    {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ]
]
scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[])
```

### Inputs
- **predictions** (`list` of `list` of `dictionary`): A list of list of dictionary predicted relations from the model.
- **references** (`list` of `list` of `dictionary`): A list of list of dictionary ground-truth or reference relations to compare the predictions against.
- **mode** (`str`, Optional): Evaluation mode - `strict` or `boundaries`. Default `strict`. `strict` mode takes into account both entities type and their relationships, while `boundaries` mode only considers the entity spans of the relationships.
- **detailed_scores** (`bool`, Optional): Default `False`. If `True` it returns scores for each relation type specifically, if `False` it returns the overall scores. 
- **relation_types** (`list`, Optional): Default `[]`. A list of relation types to consider while evaluating. If not provided, relation types will be constructed from the ground truth or reference data.
 
### Output Values

**output** (`dictionary` of `dictionaries`) A dictionary mapping each entity type to its respective scoring metrics such as Precision, Recall, F1 score.
- **ALL** (`dictionary`): score of total relation type 
  - **tp** : true positive count
  - **fp** : false positive count
  - **fn** : false negative count
  - **p** : precision
  - **r** : recall
  - **f1** : micro f1 score
  - **Macro_f1** : macro f1 score
  - **Macro_p** : macro precision
  - **Macro_r** : macro recall
- **{selected relation type}** (`dictionary`): score of selected relation type
  - **tp** : true positive count
  - **fp** : false positive count
  - **fn** : false negative count
  - **p** : precision
  - **r** : recall
  - **f1** : micro f1 score
 
Output Example:
```python
{'tp': 1, 'fp': 1, 'fn': 1, 'p': 50.0, 'r': 50.0, 'f1': 50.0, 'Macro_f1': 50.0, 'Macro_p': 50.0, 'Macro_r': 50.0}
```

Note : `Macro_f1`, `Macro_p`, `Macro_r`, `p`, `r`, `f1` are always numbers between 0 and 1. The values of `tp`, `fp`, `fn` depend on the number of data inputs.

### Examples
Example1 : Only one prediction and reference.
```python 
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
  [
    {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
  ]
]
predictions = [
  [
    {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ]
]
scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[])
print(scores)
>>> {'tp': 1, 'fp': 1, 'fn': 2, 'p': 50.0, 'r': 33.333333333333336, 'f1': 40.0, 'Macro_f1': 25.0, 'Macro_p': 25.0, 'Macro_r': 25.0}
```

Example 2 : Two or more prediction and reference. Output all score of relation type.
```python
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
  [
    {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ],
  [
    {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
    {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
  ]
]
predictions = [
  [
    {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ],
  [
    {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
    {'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}  
  ]
]
scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=[])
print(scores)
>>> {'sell': {'tp': 3, 'fp': 1, 'fn': 0, 'p': 75.0, 'r': 100.0, 'f1': 85.71428571428571}, 'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 3, 'fp': 1, 'fn': 1, 'p': 75.0, 'r': 75.0, 'f1': 75.0, 'Macro_f1': 42.857142857142854, 'Macro_p': 37.5, 'Macro_r': 50.0}}
```

Example 3 : Two or more prediction and reference. Output all score of relation type. Consider only the score of type "belongs_to".
```python
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
  [
    {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ],
  [
    {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
    {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
  ]
]
predictions = [
  [
    {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ],
  [
    {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
    {'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}  
  ]
]
scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=["belongs_to"])
print(scores)  
>>> {'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0, 'Macro_f1': 0.0, 'Macro_p': 0.0, 'Macro_r': 0.0}}
```

## Limitations and Bias
There are two mode in this metric : `strict` and `boundaries`. It offers multiple `relation_types` to choose from. Ensure you choose appropriate evaluation parameters, as they can significantly impact the F1 score.
The entity(`head`,`tail`,`head_type`,`tail_type`) in both the prediction and reference should match exactly, disregarding case and spaces. If the prediction doesn't match the reference exactly, it will be counted as either a false positive (`fp`) or a false negative (`fn`). 

## Citation
```bibtex
@Paper{
    author = {Bruno Taillé, Vincent Guigue, Geoffrey Scoutheeten, Patrick Gallinari},
    title = {Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!},
    year = {2020},
    link = https://arxiv.org/abs/2009.10684
}
```
## Further References
This evaluation metric revised from
*https://github.com/btaille/sincere/blob/master/code/utils/evaluation.py*