--- title: relation_extraction datasets: - none tags: - evaluate - metric description: >- This metric is used for evaluating the F1 accuracy of input references and predictions. sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false license: apache-2.0 --- # Metric Card for relation_extraction evaluation This metric is used for evaluating the quality of relation extraction output. By calculating the Micro and Macro F1 score of every relation extraction outputs to ensure the quality. ## Metric Description This metric computes and returns various scoring metrics for the prediction model based on the mode specified, including Precision, Recall, F1-Score and others. It evaluates the model's predictions against the provided reference data. ## How to Use ```python import evaluate metric = evaluate.load("Ikala-allen/relation_extraction") references = [ [ {"head": "phip igments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, ] ] predictions = [ [ {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, ] ] scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[]) ``` ### Inputs - **predictions** (`list` of `list` of `dictionary`): A list of list of dictionary predicted relations from the model. - **references** (`list` of `list` of `dictionary`): A list of list of dictionary ground-truth or reference relations to compare the predictions against. - **mode** (`str`, Optional): Evaluation mode - `strict` or `boundaries`. Default `strict`. `strict` mode takes into account both entities type and their relationships, while `boundaries` mode only considers the entity spans of the relationships. - **detailed_scores** (`bool`, Optional): Default `False`. If `True` it returns scores for each relation type specifically, if `False` it returns the overall scores. - **relation_types** (`list`, Optional): Default `[]`. A list of relation types to consider while evaluating. If not provided, relation types will be constructed from the ground truth or reference data. ### Output Values **output** (`dictionary` of `dictionaries`) A dictionary mapping each entity type to its respective scoring metrics such as Precision, Recall, F1 score. - **ALL** (`dictionary`): score of total relation type - **tp** : true positive count - **fp** : false positive count - **fn** : false negative count - **p** : precision - **r** : recall - **f1** : micro f1 score - **Macro_f1** : macro f1 score - **Macro_p** : macro precision - **Macro_r** : macro recall - **{selected relation type}** (`dictionary`): score of selected relation type - **tp** : true positive count - **fp** : false positive count - **fn** : false negative count - **p** : precision - **r** : recall - **f1** : micro f1 score Output Example: ```python {'tp': 1, 'fp': 1, 'fn': 1, 'p': 50.0, 'r': 50.0, 'f1': 50.0, 'Macro_f1': 50.0, 'Macro_p': 50.0, 'Macro_r': 50.0} ``` Note : `Macro_f1`, `Macro_p`, `Macro_r`, `p`, `r`, `f1` are always numbers between 0 and 1. The values of `tp`, `fp`, `fn` depend on the number of data inputs. ### Examples Example1 : Only one prediction and reference. ```python metric = evaluate.load("Ikala-allen/relation_extraction") references = [ [ {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'}, ] ] predictions = [ [ {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, ] ] scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[]) print(scores) >>> {'tp': 1, 'fp': 1, 'fn': 2, 'p': 50.0, 'r': 33.333333333333336, 'f1': 40.0, 'Macro_f1': 25.0, 'Macro_p': 25.0, 'Macro_r': 25.0} ``` Example 2 : Two or more prediction and reference. Output all score of relation type. ```python metric = evaluate.load("Ikala-allen/relation_extraction") references = [ [ {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, ], [ {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}, {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'}, ] ] predictions = [ [ {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, ], [ {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}, {'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'} ] ] scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=[]) print(scores) >>> {'sell': {'tp': 3, 'fp': 1, 'fn': 0, 'p': 75.0, 'r': 100.0, 'f1': 85.71428571428571}, 'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 3, 'fp': 1, 'fn': 1, 'p': 75.0, 'r': 75.0, 'f1': 75.0, 'Macro_f1': 42.857142857142854, 'Macro_p': 37.5, 'Macro_r': 50.0}} ``` Example 3 : Two or more prediction and reference. Output all score of relation type. Consider only the score of type "belongs_to". ```python metric = evaluate.load("Ikala-allen/relation_extraction") references = [ [ {"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, ], [ {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}, {'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'}, ] ] predictions = [ [ {"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, {"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"}, ], [ {'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}, {'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'} ] ] scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=["belongs_to"]) print(scores) >>> {'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0, 'Macro_f1': 0.0, 'Macro_p': 0.0, 'Macro_r': 0.0}} ``` ## Limitations and Bias There are two mode in this metric : `strict` and `boundaries`. It offers multiple `relation_types` to choose from. Ensure you choose appropriate evaluation parameters, as they can significantly impact the F1 score. The entity(`head`,`tail`,`head_type`,`tail_type`) in both the prediction and reference should match exactly, disregarding case and spaces. If the prediction doesn't match the reference exactly, it will be counted as either a false positive (`fp`) or a false negative (`fn`). ## Citation ```bibtex @Paper{ author = {Bruno Taillé, Vincent Guigue, Geoffrey Scoutheeten, Patrick Gallinari}, title = {Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!}, year = {2020}, link = https://arxiv.org/abs/2009.10684 } ``` ## Further References This evaluation metric revised from *https://github.com/btaille/sincere/blob/master/code/utils/evaluation.py*