zli12321 commited on
Commit
6bc2a15
β€’
1 Parent(s): 602e1ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -142
README.md CHANGED
@@ -9,54 +9,59 @@ metrics:
9
  - bertscore
10
  pipeline_tag: text-classification
11
  ---
12
- # QA-Evaluation-Metrics
13
 
14
  [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
- [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
 
17
- QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic and efficient metrics to assess the performance of QA models.
18
 
19
- ### Updates
20
- - Uopdated to version 0.2.17
21
- - Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
22
- - Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
23
- - Added trained tiny-bert for QA evaluation. Model size is 18 MB.
24
- - Pass huggingface repository name to download model directly for TransformerMatcher
25
 
 
26
 
27
- ## Installation
28
- * Python version >= 3.6
29
- * openai version >= 1.0
 
 
 
 
30
 
 
31
 
32
- To install the package, run the following command:
 
 
33
 
 
34
  ```bash
35
  pip install qa-metrics
36
  ```
37
 
38
- ## Usage/Logistics
39
 
40
- The python package currently provides six QA evaluation methods.
41
- - Given a set of gold answers, a candidate answer to be evaluated, and a question (if applicable), the evaluation returns True if the candidate answer matches any one of the gold answer, False otherwise.
42
- - Different evaluation methods have distinct strictness of evaluating the correctness of a candidate answer. Some have higher correlation with human judgments than others.
43
- - Normalized Exact Match and Question/Answer type Evaluation are the most efficient method. They are suitable for short-form QA datasets such as NQ-OPEN, Hotpot QA, TriviaQA, SQuAD, etc.
44
- - Question/Answer Type Evaluation and Transformer Neural evaluations are cost free and suitable for short-form and longer-form QA datasets. They have higher correlation with human judgments than exact match and F1 score when the length of the gold and candidate answers become long.
45
- - Black-box LLM evaluations are closest to human evaluations, and they are not cost-free.
46
 
47
- ## Normalized Exact Match
48
- #### `em_match`
 
 
 
 
 
49
 
50
- Returns a boolean indicating whether there are any exact normalized matches between gold and candidate answers.
51
 
52
- **Parameters**
53
 
54
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
55
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
 
 
56
 
57
  **Returns**
58
-
59
- - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
60
 
61
  ```python
62
  from qa_metrics.em import em_match
@@ -64,202 +69,223 @@ from qa_metrics.em import em_match
64
  reference_answer = ["The Frog Prince", "The Princess and the Frog"]
65
  candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
66
  match_result = em_match(reference_answer, candidate_answer)
67
- print("Exact Match: ", match_result)
68
- '''
69
- Exact Match: False
70
- '''
71
  ```
72
 
73
- ## F1 Score
74
- #### `f1_score_with_precision_recall`
75
-
76
- Calculates F1 score, precision, and recall between a reference and a candidate answer.
77
 
 
78
  **Parameters**
79
-
80
- - `reference_answer` (str): A gold (correct) answers to the question.
81
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
82
 
83
  **Returns**
 
84
 
85
- - `dictionary`: A dictionary containing the F1 score, precision, and recall between a gold and candidate answer.
 
 
 
 
 
 
 
86
 
87
  ```python
88
- from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
89
 
90
  f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
91
- print("F1 stats: ", f1_stats)
92
- '''
93
- F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
94
- '''
95
-
96
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
97
- print("F1 Match: ", match_result)
98
- '''
99
- F1 Match: False
100
- '''
101
  ```
102
 
103
- ## Efficient and Robust Question/Answer Type Evaluation
104
- #### 1. `get_highest_score`
105
-
106
- Returns the gold answer and candidate answer pair that has the highest matching score. This function is useful for evaluating the closest match to a given candidate response based on a list of reference answers.
107
 
 
108
  **Parameters**
109
-
110
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
111
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
112
- - `question` (str): The question for which the answers are being evaluated.
113
 
114
  **Returns**
 
115
 
116
- - `dictionary`: A dictionary containing the gold answer and candidate answer that have the highest matching score.
117
-
118
- #### 2. `get_scores`
 
 
119
 
120
- Returns all the gold answer and candidate answer pairs' matching scores.
 
121
 
 
122
  **Parameters**
123
-
124
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
125
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
126
- - `question` (str): The question for which the answers are being evaluated.
127
 
128
  **Returns**
 
129
 
130
- - `dictionary`: A dictionary containing gold answers and the candidate answer's matching score.
131
-
132
- #### 3. `evaluate`
 
 
133
 
134
- Returns True if the candidate answer is a match of any of the gold answers.
 
135
 
 
136
  **Parameters**
137
-
138
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
139
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
140
- - `question` (str): The question for which the answers are being evaluated.
141
 
142
  **Returns**
 
143
 
144
- - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
 
 
 
 
145
 
 
 
146
 
147
  ```python
148
  from qa_metrics.pedant import PEDANT
149
 
150
- question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
151
  pedant = PEDANT()
152
  scores = pedant.get_scores(reference_answer, candidate_answer, question)
153
- max_pair, highest_scores = pedant.get_highest_score(reference_answer, candidate_answer, question)
154
  match_result = pedant.evaluate(reference_answer, candidate_answer, question)
155
- print("Max Pair: %s; Highest Score: %s" % (max_pair, highest_scores))
156
- print("Score: %s; PANDA Match: %s" % (scores, match_result))
157
- '''
158
- Max Pair: ('the princess and the frog', 'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"'); Highest Score: 0.854451712151719
159
- Score: {'the frog prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7131625951317375}, 'the princess and the frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.854451712151719}}; PANDA Match: True
160
- '''
161
- ```
162
-
163
- ```python
164
- print(pedant.get_score(reference_answer[1], candidate_answer, question))
165
- '''
166
- 0.7122460127464126
167
- '''
168
  ```
169
 
170
- ## Transformer Neural Evaluation
171
- Our fine-tuned BERT model is on πŸ€— [Huggingface](https://huggingface.co/Zongxia/answer_equivalence_bert?text=The+goal+of+life+is+%5BMASK%5D.). Our Package also supports downloading and matching directly. [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), [roberta](https://huggingface.co/Zongxia/answer_equivalence_roberta), and [roberta-large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) are also supported now! πŸ”₯πŸ”₯πŸ”₯
172
 
173
- #### `transformer_match`
 
 
 
 
174
 
175
- Returns True if the candidate answer is a match of any of the gold answers.
 
176
 
 
177
  **Parameters**
 
 
 
 
 
 
178
 
179
- - `reference_answer` (list of str): A list of gold (correct) answers to the question.
180
- - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
181
- - `question` (str): The question for which the answers are being evaluated.
 
 
182
 
183
  **Returns**
 
 
 
 
 
 
 
184
 
185
- - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
 
186
 
187
  ```python
188
  from qa_metrics.transformerMatcher import TransformerMatcher
189
 
190
- question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
191
- # Supported models: roberta-large, roberta, bert, distilbert, distilroberta
192
- tm = TransformerMatcher("zli12321/answer_equivalence_roberta")
193
- scores = tm.get_scores(reference_answer, candidate_answer, question)
194
  match_result = tm.transformer_match(reference_answer, candidate_answer, question)
195
- print("Score: %s; bert Match: %s" % (scores, match_result))
196
- '''
197
- Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.6934309}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7400551}}; TM Match: True
198
- '''
199
  ```
200
 
201
- ## Prompting LLM For Evaluation
202
 
203
- Note: The prompting function can be used for any prompting purposes.
 
 
 
 
 
204
 
205
- ###### OpenAI
206
  ```python
207
  from qa_metrics.prompt_llm import CloseLLM
 
208
  model = CloseLLM()
209
  model.set_openai_api_key(YOUR_OPENAI_KEY)
210
- prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
211
- model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
212
-
213
- '''
214
- 'correct'
215
- '''
216
  ```
217
 
218
- ###### Anthropic
 
 
 
 
 
 
 
219
  ```python
220
  model = CloseLLM()
221
- model.set_anthropic_api_key(YOUR_Anthropic_KEY)
222
- model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
223
-
224
- '''
225
- 'correct'
226
- '''
227
  ```
228
 
229
- ###### deepinfra (See below for descriptions of more models)
 
 
 
 
 
 
230
  ```python
231
  from qa_metrics.prompt_open_llm import OpenLLM
 
232
  model = OpenLLM()
233
  model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
234
- model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
235
-
236
- '''
237
- 'correct'
238
- '''
239
  ```
240
 
241
- If you find this repo avialable, please cite our paper:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
  ```bibtex
243
- @misc{li2024panda,
244
- title={PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation},
245
  author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
246
  year={2024},
247
  eprint={2402.11161},
248
  archivePrefix={arXiv},
249
- primaryClass={cs.CL}
 
250
  }
251
  ```
252
 
 
253
 
254
- ## Updates
255
- - [01/24/24] πŸ”₯ The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2402.11161). The dataset is expanded and leaderboard is updated.
256
- - Our Training Dataset is adapted and augmented from [Bulian et al](https://github.com/google-research-datasets/answer-equivalence-dataset). Our [dataset repo](https://github.com/zli12321/Answer_Equivalence_Dataset.git) includes the augmented training set and QA evaluation testing sets discussed in our paper.
257
- - Now our model supports [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), a smaller and more robust matching model than Bert!
258
-
259
- ## License
260
-
261
- This project is licensed under the [MIT License](LICENSE.md) - see the LICENSE file for details.
262
 
263
- ## Contact
264
 
265
- For any additional questions or comments, please contact [[email protected]].
 
9
  - bertscore
10
  pipeline_tag: text-classification
11
  ---
12
+ # QA-Evaluation-Metrics πŸ“Š
13
 
14
  [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ke23KIeHFdPWad0BModmcWKZ6jSbF5nI?usp=sharing)
16
 
17
+ > Check out the main [Repo](https://github.com/zli12321/qa_metrics)
18
 
19
+ > A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.
 
 
 
 
 
20
 
21
+ ## πŸŽ‰ Latest Updates
22
 
23
+ - **Version 0.2.19 Released!**
24
+ - Paper accepted to EMNLP 2024 Findings! πŸŽ“
25
+ - Enhanced PEDANTS with multi-pipeline support and improved edge case handling
26
+ - Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
27
+ - Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via [deepinfra](https://deepinfra.com/models)
28
+ - Introduced trained tiny-bert for QA evaluation (18MB model size)
29
+ - Added direct Huggingface model download support for TransformerMatcher
30
 
31
+ ## πŸš€ Quick Start
32
 
33
+ ### Prerequisites
34
+ - Python >= 3.6
35
+ - openai >= 1.0
36
 
37
+ ### Installation
38
  ```bash
39
  pip install qa-metrics
40
  ```
41
 
42
+ ## πŸ’‘ Features
43
 
44
+ Our package offers six QA evaluation methods with varying strengths:
 
 
 
 
 
45
 
46
+ | Method | Best For | Cost | Correlation with Human Judgment |
47
+ |--------|----------|------|--------------------------------|
48
+ | Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |
49
+ | PEDANTS | Both short & medium-form QA | Free | Very High |
50
+ | [Neural Evaluation](https://huggingface.co/zli12321/answer_equivalence_tiny_bert) | Both short & long-form QA | Free | High |
51
+ | [Open Source LLM Evaluation](https://huggingface.co/zli12321/prometheus2-2B) | All QA types | Free | High |
52
+ | Black-box LLM Evaluation | All QA types | Paid | Highest |
53
 
54
+ ## πŸ“– Documentation
55
 
56
+ ### 1. Normalized Exact Match
57
 
58
+ #### Method: `em_match`
59
+ **Parameters**
60
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question
61
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
62
 
63
  **Returns**
64
+ - `boolean`: True if there are any exact normalized matches between gold and candidate answers
 
65
 
66
  ```python
67
  from qa_metrics.em import em_match
 
69
  reference_answer = ["The Frog Prince", "The Princess and the Frog"]
70
  candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
71
  match_result = em_match(reference_answer, candidate_answer)
 
 
 
 
72
  ```
73
 
74
+ ### 2. F1 Score
 
 
 
75
 
76
+ #### Method: `f1_score_with_precision_recall`
77
  **Parameters**
78
+ - `reference_answer` (str): A gold (correct) answer to the question
79
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
 
80
 
81
  **Returns**
82
+ - `dictionary`: Contains the F1 score, precision, and recall between a gold and candidate answer
83
 
84
+ #### Method: `f1_match`
85
+ **Parameters**
86
+ - `reference_answer` (list of str): List of gold answers
87
+ - `candidate_answer` (str): Candidate answer to evaluate
88
+ - `threshold` (float): F1 score threshold for considering a match (default: 0.5)
89
+
90
+ **Returns**
91
+ - `boolean`: True if F1 score exceeds threshold for any gold answer
92
 
93
  ```python
94
+ from qa_metrics.f1 import f1_match, f1_score_with_precision_recall
95
 
96
  f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
 
 
 
 
 
97
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
 
 
 
 
98
  ```
99
 
100
+ ### 3. PEDANTS
 
 
 
101
 
102
+ #### Method: `get_score`
103
  **Parameters**
104
+ - `reference_answer` (str): A Gold answer
105
+ - `candidate_answer` (str): Candidate answer to evaluate
106
+ - `question` (str): The question being evaluated
 
107
 
108
  **Returns**
109
+ - `float`: The similarity score between two strings (0 to 1)
110
 
111
+ #### Method: `get_highest_score`
112
+ **Parameters**
113
+ - `reference_answer` (list of str): List of gold answers
114
+ - `candidate_answer` (str): Candidate answer to evaluate
115
+ - `question` (str): The question being evaluated
116
 
117
+ **Returns**
118
+ - `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
119
 
120
+ #### Method: `get_scores`
121
  **Parameters**
122
+ - `reference_answer` (list of str): List of gold answers
123
+ - `candidate_answer` (str): Candidate answer to evaluate
124
+ - `question` (str): The question being evaluated
 
125
 
126
  **Returns**
127
+ - `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
128
 
129
+ #### Method: `evaluate`
130
+ **Parameters**
131
+ - `reference_answer` (list of str): List of gold answers
132
+ - `candidate_answer` (str): Candidate answer to evaluate
133
+ - `question` (str): The question being evaluated
134
 
135
+ **Returns**
136
+ - `boolean`: True if candidate answer matches any gold answer
137
 
138
+ #### Method: `get_question_type`
139
  **Parameters**
140
+ - `reference_answer` (list of str): List of gold answers
141
+ - `question` (str): The question being evaluated
 
 
142
 
143
  **Returns**
144
+ - `list`: The type of the question (what, who, when, how, why, which, where)
145
 
146
+ #### Method: `get_judgement_type`
147
+ **Parameters**
148
+ - `reference_answer` (list of str): List of gold answers
149
+ - `candidate_answer` (str): Candidate answer to evaluate
150
+ - `question` (str): The question being evaluated
151
 
152
+ **Returns**
153
+ - `list`: A list revised rules applicable to judge answer correctness
154
 
155
  ```python
156
  from qa_metrics.pedant import PEDANT
157
 
 
158
  pedant = PEDANT()
159
  scores = pedant.get_scores(reference_answer, candidate_answer, question)
 
160
  match_result = pedant.evaluate(reference_answer, candidate_answer, question)
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  ```
162
 
163
+ ### 4. Transformer Neural Evaluation
 
164
 
165
+ #### Method: `get_score`
166
+ **Parameters**
167
+ - `reference_answer` (str): A Gold answer
168
+ - `candidate_answer` (str): Candidate answer to evaluate
169
+ - `question` (str): The question being evaluated
170
 
171
+ **Returns**
172
+ - `float`: The similarity score between two strings (0 to 1)
173
 
174
+ #### Method: `get_highest_score`
175
  **Parameters**
176
+ - `reference_answer` (list of str): List of gold answers
177
+ - `candidate_answer` (str): Candidate answer to evaluate
178
+ - `question` (str): The question being evaluated
179
+
180
+ **Returns**
181
+ - `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
182
 
183
+ #### Method: `get_scores`
184
+ **Parameters**
185
+ - `reference_answer` (list of str): List of gold answers
186
+ - `candidate_answer` (str): Candidate answer to evaluate
187
+ - `question` (str): The question being evaluated
188
 
189
  **Returns**
190
+ - `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
191
+
192
+ #### Method: `transformer_match`
193
+ **Parameters**
194
+ - `reference_answer` (list of str): List of gold answers
195
+ - `candidate_answer` (str): Candidate answer to evaluate
196
+ - `question` (str): The question being evaluated
197
 
198
+ **Returns**
199
+ - `boolean`: True if transformer model considers candidate answer equivalent to any gold answer
200
 
201
  ```python
202
  from qa_metrics.transformerMatcher import TransformerMatcher
203
 
204
+ ### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
205
+ tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
 
 
206
  match_result = tm.transformer_match(reference_answer, candidate_answer, question)
 
 
 
 
207
  ```
208
 
209
+ ### 5. LLM Integration
210
 
211
+ #### Method: `prompt_gpt`
212
+ **Parameters**
213
+ - `prompt` (str): The input prompt text
214
+ - `model_engine` (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
215
+ - `temperature` (float): Controls randomness (0-1)
216
+ - `max_tokens` (int): Maximum tokens in response
217
 
 
218
  ```python
219
  from qa_metrics.prompt_llm import CloseLLM
220
+
221
  model = CloseLLM()
222
  model.set_openai_api_key(YOUR_OPENAI_KEY)
223
+ result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')
 
 
 
 
 
224
  ```
225
 
226
+ #### Method: `prompt_claude`
227
+ **Parameters**
228
+ - `prompt` (str): The input prompt text
229
+ - `model_engine` (str): Claude model to use
230
+ - `anthropic_version` (str): API version
231
+ - `max_tokens_to_sample` (int): Maximum tokens in response
232
+ - `temperature` (float): Controls randomness (0-1)
233
+
234
  ```python
235
  model = CloseLLM()
236
+ model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
237
+ result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')
 
 
 
 
238
  ```
239
 
240
+ #### Method: `prompt`
241
+ **Parameters**
242
+ - `message` (str): The input message text
243
+ - `model_engine` (str): Model to use
244
+ - `temperature` (float): Controls randomness (0-1)
245
+ - `max_tokens` (int): Maximum tokens in response
246
+
247
  ```python
248
  from qa_metrics.prompt_open_llm import OpenLLM
249
+
250
  model = OpenLLM()
251
  model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
252
+ result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')
 
 
 
 
253
  ```
254
 
255
+ ## πŸ€— Model Hub
256
+
257
+ Our fine-tuned models are available on Huggingface:
258
+ - [BERT](https://huggingface.co/Zongxia/answer_equivalence_bert)
259
+ - [DistilRoBERTa](https://huggingface.co/Zongxia/answer_equivalence_distilroberta)
260
+ - [DistilBERT](https://huggingface.co/Zongxia/answer_equivalence_distilbert)
261
+ - [RoBERTa](https://huggingface.co/Zongxia/answer_equivalence_roberta)
262
+ - [Tiny-BERT](https://huggingface.co/Zongxia/answer_equivalence_tiny_bert)
263
+ - [RoBERTa-Large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large)
264
+
265
+ ## πŸ“š Resources
266
+
267
+ - [Full Paper](https://arxiv.org/abs/2402.11161)
268
+ - [Dataset Repository](https://github.com/zli12321/Answer_Equivalence_Dataset.git)
269
+ - [Supported Models on Deepinfra](https://deepinfra.com/models)
270
+
271
+ ## πŸ“„ Citation
272
+
273
  ```bibtex
274
+ @misc{li2024pedantspreciseevaluationsdiverse,
275
+ title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence},
276
  author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
277
  year={2024},
278
  eprint={2402.11161},
279
  archivePrefix={arXiv},
280
+ primaryClass={cs.CL},
281
+ url={https://arxiv.org/abs/2402.11161},
282
  }
283
  ```
284
 
285
+ ## πŸ“ License
286
 
287
+ This project is licensed under the [MIT License](LICENSE.md).
 
 
 
 
 
 
 
288
 
289
+ ## πŸ“¬ Contact
290
 
291
+ For questions or comments, please contact: [email protected]