Spaces:
Runtime error
Runtime error
title: relation_extraction | |
datasets: | |
- none | |
tags: | |
- evaluate | |
- metric | |
description: >- | |
This metric is used for evaluating the F1 accuracy of input references and | |
predictions. | |
sdk: gradio | |
sdk_version: 3.19.1 | |
app_file: app.py | |
pinned: false | |
license: apache-2.0 | |
# Metric Card for relation_extraction evaluation | |
This metric is used for evaluating the quality of relation extraction output. By calculating the Micro and Macro F1 score of every relation extraction outputs to ensure the quality. | |
## Metric Description | |
This metric computes and returns various scoring metrics for the prediction model based on the mode specified, including Precision, Recall, F1-Score and others. It evaluates the model's predictions against the provided reference data. | |
## How to Use | |
```python | |
import evaluate | |
metric = evaluate.load("Ikala-allen/relation_extraction") | |
references = [ | |
[ | |
{"head": "phip igments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
] | |
] | |
predictions = [ | |
[ | |
{"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
] | |
] | |
scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[]) | |
``` | |
### Inputs | |
- **predictions** (`list` of `list` of `dictionary`): A list of list of dictionary predicted relations from the model. | |
- **references** (`list` of `list` of `dictionary`): A list of list of dictionary ground-truth or reference relations to compare the predictions against. | |
- **mode** (`str`, Optional): Evaluation mode - `strict` or `boundaries`. Default `strict`. `strict` mode takes into account both entities type and their relationships, while `boundaries` mode only considers the entity spans of the relationships. | |
- **detailed_scores** (`bool`, Optional): Default `False`. If `True` it returns scores for each relation type specifically, if `False` it returns the overall scores. | |
- **relation_types** (`list`, Optional): Default `[]`. A list of relation types to consider while evaluating. If not provided, relation types will be constructed from the ground truth or reference data. | |
### Output Values | |
**output** (`dictionary` of `dictionaries`) A dictionary mapping each entity type to its respective scoring metrics such as Precision, Recall, F1 score. | |
- **ALL** (`dictionary`): score of total relation type | |
- **tp** : true positive count | |
- **fp** : false positive count | |
- **fn** : false negative count | |
- **p** : precision | |
- **r** : recall | |
- **f1** : micro f1 score | |
- **Macro_f1** : macro f1 score | |
- **Macro_p** : macro precision | |
- **Macro_r** : macro recall | |
- **{selected relation type}** (`dictionary`): score of selected relation type | |
- **tp** : true positive count | |
- **fp** : false positive count | |
- **fn** : false negative count | |
- **p** : precision | |
- **r** : recall | |
- **f1** : micro f1 score | |
Output Example: | |
```python | |
{'tp': 1, 'fp': 1, 'fn': 1, 'p': 50.0, 'r': 50.0, 'f1': 50.0, 'Macro_f1': 50.0, 'Macro_p': 50.0, 'Macro_r': 50.0} | |
``` | |
Note : `Macro_f1`, `Macro_p`, `Macro_r`, `p`, `r`, `f1` are always numbers between 0 and 1. The values of `tp`, `fp`, `fn` depend on the number of data inputs. | |
### Examples | |
Example1 : Only one prediction and reference. | |
```python | |
metric = evaluate.load("Ikala-allen/relation_extraction") | |
references = [ | |
[ | |
{"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'}, | |
] | |
] | |
predictions = [ | |
[ | |
{"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
] | |
] | |
scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[]) | |
print(scores) | |
>>> {'tp': 1, 'fp': 1, 'fn': 2, 'p': 50.0, 'r': 33.333333333333336, 'f1': 40.0, 'Macro_f1': 25.0, 'Macro_p': 25.0, 'Macro_r': 25.0} | |
``` | |
Example 2 : Two or more prediction and reference. Output all score of relation type. | |
```python | |
metric = evaluate.load("Ikala-allen/relation_extraction") | |
references = [ | |
[ | |
{"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
], | |
[ | |
{'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}, | |
{'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'}, | |
] | |
] | |
predictions = [ | |
[ | |
{"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
], | |
[ | |
{'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}, | |
{'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'} | |
] | |
] | |
scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=[]) | |
print(scores) | |
>>> {'sell': {'tp': 3, 'fp': 1, 'fn': 0, 'p': 75.0, 'r': 100.0, 'f1': 85.71428571428571}, 'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 3, 'fp': 1, 'fn': 1, 'p': 75.0, 'r': 75.0, 'f1': 75.0, 'Macro_f1': 42.857142857142854, 'Macro_p': 37.5, 'Macro_r': 50.0}} | |
``` | |
Example 3 : Two or more prediction and reference. Output all score of relation type. Consider only the score of type "belongs_to". | |
```python | |
metric = evaluate.load("Ikala-allen/relation_extraction") | |
references = [ | |
[ | |
{"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
], | |
[ | |
{'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}, | |
{'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'}, | |
] | |
] | |
predictions = [ | |
[ | |
{"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, | |
], | |
[ | |
{'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}, | |
{'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'} | |
] | |
] | |
scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=["belongs_to"]) | |
print(scores) | |
>>> {'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0, 'Macro_f1': 0.0, 'Macro_p': 0.0, 'Macro_r': 0.0}} | |
``` | |
## Limitations and Bias | |
There are two mode in this metric : `strict` and `boundaries`. It offers multiple `relation_types` to choose from. Ensure you choose appropriate evaluation parameters, as they can significantly impact the F1 score. | |
The entity(`head`,`tail`,`head_type`,`tail_type`) in both the prediction and reference should match exactly, disregarding case and spaces. If the prediction doesn't match the reference exactly, it will be counted as either a false positive (`fp`) or a false negative (`fn`). | |
## Citation | |
```bibtex | |
@Paper{ | |
author = {Bruno Taillé, Vincent Guigue, Geoffrey Scoutheeten, Patrick Gallinari}, | |
title = {Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!}, | |
year = {2020}, | |
link = https://arxiv.org/abs/2009.10684 | |
} | |
``` | |
## Further References | |
This evaluation metric revised from | |
*https://github.com/btaille/sincere/blob/master/code/utils/evaluation.py* |