Zongxia commited on
Commit
3b4e6ce
β€’
1 Parent(s): fe3a7bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -10
README.md CHANGED
@@ -11,8 +11,7 @@ pipeline_tag: text-classification
11
  ---
12
  # QA-Evaluation-Metrics
13
 
14
- [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
-
16
 
17
  QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), a matching method going beyond token-level matching and is more efficient than LLM matchings but still retains competitive evaluation performance of transformer LLM models.
18
 
@@ -33,23 +32,29 @@ The python package currently provides four QA evaluation metrics.
33
  ```python
34
  from qa_metrics.em import em_match
35
 
36
- reference_answer = ["Charles , Prince of Wales"]
37
- candidate_answer = "Prince Charles"
38
  match_result = em_match(reference_answer, candidate_answer)
39
  print("Exact Match: ", match_result)
 
 
 
40
  ```
41
 
42
  #### Transformer Match
43
- Our fine-tuned BERT model is on πŸ€— [Huggingface](https://huggingface.co/Zongxia/answer_equivalence_bert?text=The+goal+of+life+is+%5BMASK%5D.). Our Package also supports downloading and matching directly. [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), [roberta](https://huggingface.co/Zongxia/answer_equivalence_roberta), and [roberta-large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) are also supported now! πŸ”₯πŸ”₯πŸ”₯
44
 
45
  ```python
46
  from qa_metrics.transformerMatcher import TransformerMatcher
47
 
48
- question = "who will take the throne after the queen dies"
49
  tm = TransformerMatcher("bert")
50
  scores = tm.get_scores(reference_answer, candidate_answer, question)
51
  match_result = tm.transformer_match(reference_answer, candidate_answer, question)
52
- print("Score: %s; CF Match: %s" % (scores, match_result))
 
 
 
53
  ```
54
 
55
  #### F1 Score
@@ -61,17 +66,24 @@ print("F1 stats: ", f1_stats)
61
 
62
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
63
  print("F1 Match: ", match_result)
 
 
 
 
64
  ```
65
 
66
  #### PANDA
67
  ```python
68
  from qa_metrics.cfm import CFMatcher
69
 
70
- question = "who will take the throne after the queen dies"
71
  cfm = CFMatcher()
72
  scores = cfm.get_scores(reference_answer, candidate_answer, question)
73
  match_result = cfm.cf_match(reference_answer, candidate_answer, question)
74
- print("Score: %s; bert Match: %s" % (scores, match_result))
 
 
 
75
  ```
76
 
77
  If you find this repo avialable, please cite our paper:
@@ -88,7 +100,7 @@ If you find this repo avialable, please cite our paper:
88
 
89
 
90
  ## Updates
91
- - [01/24/24] πŸ”₯ The full paper is uploaded and can be accessed [here]([https://arxiv.org/abs/2310.14566](https://arxiv.org/abs/2401.13170)). The dataset is expanded and leaderboard is updated.
92
  - Our Training Dataset is adapted and augmented from [Bulian et al](https://github.com/google-research-datasets/answer-equivalence-dataset). Our [dataset repo](https://github.com/zli12321/Answer_Equivalence_Dataset.git) includes the augmented training set and QA evaluation testing sets discussed in our paper.
93
  - Now our model supports [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), a smaller and more robust matching model than Bert!
94
 
 
11
  ---
12
  # QA-Evaluation-Metrics
13
 
14
+ [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
 
15
 
16
  QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), a matching method going beyond token-level matching and is more efficient than LLM matchings but still retains competitive evaluation performance of transformer LLM models.
17
 
 
32
  ```python
33
  from qa_metrics.em import em_match
34
 
35
+ reference_answer = ["The Frog Prince", "The Princess and the Frog"]
36
+ candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
37
  match_result = em_match(reference_answer, candidate_answer)
38
  print("Exact Match: ", match_result)
39
+ '''
40
+ Exact Match: False
41
+ '''
42
  ```
43
 
44
  #### Transformer Match
45
+ Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! πŸ”₯πŸ”₯πŸ”₯
46
 
47
  ```python
48
  from qa_metrics.transformerMatcher import TransformerMatcher
49
 
50
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
51
  tm = TransformerMatcher("bert")
52
  scores = tm.get_scores(reference_answer, candidate_answer, question)
53
  match_result = tm.transformer_match(reference_answer, candidate_answer, question)
54
+ print("Score: %s; TM Match: %s" % (scores, match_result))
55
+ '''
56
+ Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.88954514}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.9381995}}; TM Match: True
57
+ '''
58
  ```
59
 
60
  #### F1 Score
 
66
 
67
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
68
  print("F1 Match: ", match_result)
69
+ '''
70
+ F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
71
+ F1 Match: False
72
+ '''
73
  ```
74
 
75
  #### PANDA
76
  ```python
77
  from qa_metrics.cfm import CFMatcher
78
 
79
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
80
  cfm = CFMatcher()
81
  scores = cfm.get_scores(reference_answer, candidate_answer, question)
82
  match_result = cfm.cf_match(reference_answer, candidate_answer, question)
83
+ print("Score: %s; PD Match: %s" % (scores, match_result))
84
+ '''
85
+ Score: {'the frog prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7131625951317375}, 'the princess and the frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.854451712151719}}; PD Match: True
86
+ '''
87
  ```
88
 
89
  If you find this repo avialable, please cite our paper:
 
100
 
101
 
102
  ## Updates
103
+ - [01/24/24] πŸ”₯ The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2402.11161). The dataset is expanded and leaderboard is updated.
104
  - Our Training Dataset is adapted and augmented from [Bulian et al](https://github.com/google-research-datasets/answer-equivalence-dataset). Our [dataset repo](https://github.com/zli12321/Answer_Equivalence_Dataset.git) includes the augmented training set and QA evaluation testing sets discussed in our paper.
105
  - Now our model supports [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), a smaller and more robust matching model than Bert!
106