metadata

title: relation_extraction
datasets:
  - none
tags:
  - evaluate
  - metric
description: >-
  This metric is used for evaluating the F1 accuracy of input references and
  predictions.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
license: apache-2.0

Metric Card for relation_extraction evaluation

This metric is used for evaluating the quality of relation extraction output. By calculating the Micro and Macro F1 score of every relation extraction outputs to ensure the quality.

Metric Description

This metric computes and returns various scoring metrics for the prediction model based on the mode specified, including Precision, Recall, F1-Score and others. It evaluates the model's predictions against the provided reference data.

How to Use

import evaluate
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
  [
    {"head": "phip igments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ]
]
predictions = [
  [
    {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ]
]
scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[])

Inputs

predictions (list of list of dictionary): A list of list of dictionary predicted relations from the model.
references (list of list of dictionary): A list of list of dictionary ground-truth or reference relations to compare the predictions against.
mode (str, Optional): Evaluation mode - strict or boundaries. Default strict. strict mode takes into account both entities type and their relationships, while boundaries mode only considers the entity spans of the relationships.
detailed_scores (bool, Optional): Default False. If True it returns scores for each relation type specifically, if False it returns the overall scores.
relation_types (list, Optional): Default []. A list of relation types to consider while evaluating. If not provided, relation types will be constructed from the ground truth or reference data.

Output Values

output (dictionary of dictionaries) A dictionary mapping each entity type to its respective scoring metrics such as Precision, Recall, F1 score.

ALL (dictionary): score of total relation type
- tp : true positive count
- fp : false positive count
- fn : false negative count
- p : precision
- r : recall
- f1 : micro f1 score
- Macro_f1 : macro f1 score
- Macro_p : macro precision
- Macro_r : macro recall
{selected relation type} (dictionary): score of selected relation type
- tp : true positive count
- fp : false positive count
- fn : false negative count
- p : precision
- r : recall
- f1 : micro f1 score

Output Example:

{'tp': 1, 'fp': 1, 'fn': 1, 'p': 50.0, 'r': 50.0, 'f1': 50.0, 'Macro_f1': 50.0, 'Macro_p': 50.0, 'Macro_r': 50.0}

Note : Macro_f1, Macro_p, Macro_r, p, r, f1 are always numbers between 0 and 1. The values of tp, fp, fn depend on the number of data inputs.

Examples

Example1 : Only one prediction and reference.

metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
  [
    {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
  ]
]
predictions = [
  [
    {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ]
]
scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[])
print(scores)
>>> {'tp': 1, 'fp': 1, 'fn': 2, 'p': 50.0, 'r': 33.333333333333336, 'f1': 40.0, 'Macro_f1': 25.0, 'Macro_p': 25.0, 'Macro_r': 25.0}

Example 2 : Two or more prediction and reference. Output all score of relation type.

metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
  [
    {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ],
  [
    {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
    {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
  ]
]
predictions = [
  [
    {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ],
  [
    {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
    {'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}  
  ]
]
scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=[])
print(scores)
>>> {'sell': {'tp': 3, 'fp': 1, 'fn': 0, 'p': 75.0, 'r': 100.0, 'f1': 85.71428571428571}, 'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 3, 'fp': 1, 'fn': 1, 'p': 75.0, 'r': 75.0, 'f1': 75.0, 'Macro_f1': 42.857142857142854, 'Macro_p': 37.5, 'Macro_r': 50.0}}

Example 3 : Two or more prediction and reference. Output all score of relation type. Consider only the score of type "belongs_to".

metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
  [
    {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ],
  [
    {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
    {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
  ]
]
predictions = [
  [
    {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
    {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
  ],
  [
    {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
    {'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}  
  ]
]
scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=["belongs_to"])
print(scores)  
>>> {'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0, 'Macro_f1': 0.0, 'Macro_p': 0.0, 'Macro_r': 0.0}}

Limitations and Bias

There are two mode in this metric : strict and boundaries. It offers multiple relation_types to choose from. Ensure you choose appropriate evaluation parameters, as they can significantly impact the F1 score. The entity(head,tail,head_type,tail_type) in both the prediction and reference should match exactly, disregarding case and spaces. If the prediction doesn't match the reference exactly, it will be counted as either a false positive (fp) or a false negative (fn).

Citation

@Paper{
    author = {Bruno Taillé, Vincent Guigue, Geoffrey Scoutheeten, Patrick Gallinari},
    title = {Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!},
    year = {2020},
    link = https://arxiv.org/abs/2009.10684
}

Further References

This evaluation metric revised from https://github.com/btaille/sincere/blob/master/code/utils/evaluation.py