zli12321 commited on
Commit
687a78e
1 Parent(s): edae272

Bert mini L4

Browse files
README.md CHANGED
@@ -1,3 +1,291 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ license: mit
4
+ language:
5
+ - en
6
+ metrics:
7
+ - exact_match
8
+ - f1
9
+ - bertscore
10
+ pipeline_tag: text-classification
11
+ ---
12
+ # QA-Evaluation-Metrics 📊
13
+
14
+ [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ke23KIeHFdPWad0BModmcWKZ6jSbF5nI?usp=sharing)
16
+
17
+ > Check out the main [Repo](https://github.com/zli12321/qa_metrics)
18
+
19
+ > A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.
20
+
21
+ ## 🎉 Latest Updates
22
+
23
+ - **Version 0.2.19 Released!**
24
+ - Paper accepted to EMNLP 2024 Findings! 🎓
25
+ - Enhanced PEDANTS with multi-pipeline support and improved edge case handling
26
+ - Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
27
+ - Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via [deepinfra](https://deepinfra.com/models)
28
+ - Introduced trained tiny-bert for QA evaluation (18MB model size)
29
+ - Added direct Huggingface model download support for TransformerMatcher
30
+
31
+ ## 🚀 Quick Start
32
+
33
+ ### Prerequisites
34
+ - Python >= 3.6
35
+ - openai >= 1.0
36
+
37
+ ### Installation
38
+ ```bash
39
+ pip install qa-metrics
40
+ ```
41
+
42
+ ## 💡 Features
43
+
44
+ Our package offers six QA evaluation methods with varying strengths:
45
+
46
+ | Method | Best For | Cost | Correlation with Human Judgment |
47
+ |--------|----------|------|--------------------------------|
48
+ | Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |
49
+ | PEDANTS | Both short & medium-form QA | Free | Very High |
50
+ | [Neural Evaluation](https://huggingface.co/zli12321/answer_equivalence_tiny_bert) | Both short & long-form QA | Free | High |
51
+ | [Open Source LLM Evaluation](https://huggingface.co/zli12321/prometheus2-2B) | All QA types | Free | High |
52
+ | Black-box LLM Evaluation | All QA types | Paid | Highest |
53
+
54
+ ## 📖 Documentation
55
+
56
+ ### 1. Normalized Exact Match
57
+
58
+ #### Method: `em_match`
59
+ **Parameters**
60
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question
61
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
62
+
63
+ **Returns**
64
+ - `boolean`: True if there are any exact normalized matches between gold and candidate answers
65
+
66
+ ```python
67
+ from qa_metrics.em import em_match
68
+
69
+ reference_answer = ["The Frog Prince", "The Princess and the Frog"]
70
+ candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
71
+ match_result = em_match(reference_answer, candidate_answer)
72
+ ```
73
+
74
+ ### 2. F1 Score
75
+
76
+ #### Method: `f1_score_with_precision_recall`
77
+ **Parameters**
78
+ - `reference_answer` (str): A gold (correct) answer to the question
79
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
80
+
81
+ **Returns**
82
+ - `dictionary`: Contains the F1 score, precision, and recall between a gold and candidate answer
83
+
84
+ #### Method: `f1_match`
85
+ **Parameters**
86
+ - `reference_answer` (list of str): List of gold answers
87
+ - `candidate_answer` (str): Candidate answer to evaluate
88
+ - `threshold` (float): F1 score threshold for considering a match (default: 0.5)
89
+
90
+ **Returns**
91
+ - `boolean`: True if F1 score exceeds threshold for any gold answer
92
+
93
+ ```python
94
+ from qa_metrics.f1 import f1_match, f1_score_with_precision_recall
95
+
96
+ f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
97
+ match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
98
+ ```
99
+
100
+ ### 3. PEDANTS
101
+
102
+ #### Method: `get_score`
103
+ **Parameters**
104
+ - `reference_answer` (str): A Gold answer
105
+ - `candidate_answer` (str): Candidate answer to evaluate
106
+ - `question` (str): The question being evaluated
107
+
108
+ **Returns**
109
+ - `float`: The similarity score between two strings (0 to 1)
110
+
111
+ #### Method: `get_highest_score`
112
+ **Parameters**
113
+ - `reference_answer` (list of str): List of gold answers
114
+ - `candidate_answer` (str): Candidate answer to evaluate
115
+ - `question` (str): The question being evaluated
116
+
117
+ **Returns**
118
+ - `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
119
+
120
+ #### Method: `get_scores`
121
+ **Parameters**
122
+ - `reference_answer` (list of str): List of gold answers
123
+ - `candidate_answer` (str): Candidate answer to evaluate
124
+ - `question` (str): The question being evaluated
125
+
126
+ **Returns**
127
+ - `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
128
+
129
+ #### Method: `evaluate`
130
+ **Parameters**
131
+ - `reference_answer` (list of str): List of gold answers
132
+ - `candidate_answer` (str): Candidate answer to evaluate
133
+ - `question` (str): The question being evaluated
134
+
135
+ **Returns**
136
+ - `boolean`: True if candidate answer matches any gold answer
137
+
138
+ #### Method: `get_question_type`
139
+ **Parameters**
140
+ - `reference_answer` (list of str): List of gold answers
141
+ - `question` (str): The question being evaluated
142
+
143
+ **Returns**
144
+ - `list`: The type of the question (what, who, when, how, why, which, where)
145
+
146
+ #### Method: `get_judgement_type`
147
+ **Parameters**
148
+ - `reference_answer` (list of str): List of gold answers
149
+ - `candidate_answer` (str): Candidate answer to evaluate
150
+ - `question` (str): The question being evaluated
151
+
152
+ **Returns**
153
+ - `list`: A list revised rules applicable to judge answer correctness
154
+
155
+ ```python
156
+ from qa_metrics.pedant import PEDANT
157
+
158
+ pedant = PEDANT()
159
+ scores = pedant.get_scores(reference_answer, candidate_answer, question)
160
+ match_result = pedant.evaluate(reference_answer, candidate_answer, question)
161
+ ```
162
+
163
+ ### 4. Transformer Neural Evaluation
164
+
165
+ #### Method: `get_score`
166
+ **Parameters**
167
+ - `reference_answer` (str): A Gold answer
168
+ - `candidate_answer` (str): Candidate answer to evaluate
169
+ - `question` (str): The question being evaluated
170
+
171
+ **Returns**
172
+ - `float`: The similarity score between two strings (0 to 1)
173
+
174
+ #### Method: `get_highest_score`
175
+ **Parameters**
176
+ - `reference_answer` (list of str): List of gold answers
177
+ - `candidate_answer` (str): Candidate answer to evaluate
178
+ - `question` (str): The question being evaluated
179
+
180
+ **Returns**
181
+ - `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
182
+
183
+ #### Method: `get_scores`
184
+ **Parameters**
185
+ - `reference_answer` (list of str): List of gold answers
186
+ - `candidate_answer` (str): Candidate answer to evaluate
187
+ - `question` (str): The question being evaluated
188
+
189
+ **Returns**
190
+ - `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
191
+
192
+ #### Method: `transformer_match`
193
+ **Parameters**
194
+ - `reference_answer` (list of str): List of gold answers
195
+ - `candidate_answer` (str): Candidate answer to evaluate
196
+ - `question` (str): The question being evaluated
197
+
198
+ **Returns**
199
+ - `boolean`: True if transformer model considers candidate answer equivalent to any gold answer
200
+
201
+ ```python
202
+ from qa_metrics.transformerMatcher import TransformerMatcher
203
+
204
+ ### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
205
+ tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
206
+ match_result = tm.transformer_match(reference_answer, candidate_answer, question)
207
+ ```
208
+
209
+ ### 5. LLM Integration
210
+
211
+ #### Method: `prompt_gpt`
212
+ **Parameters**
213
+ - `prompt` (str): The input prompt text
214
+ - `model_engine` (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
215
+ - `temperature` (float): Controls randomness (0-1)
216
+ - `max_tokens` (int): Maximum tokens in response
217
+
218
+ ```python
219
+ from qa_metrics.prompt_llm import CloseLLM
220
+
221
+ model = CloseLLM()
222
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
223
+ result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')
224
+ ```
225
+
226
+ #### Method: `prompt_claude`
227
+ **Parameters**
228
+ - `prompt` (str): The input prompt text
229
+ - `model_engine` (str): Claude model to use
230
+ - `anthropic_version` (str): API version
231
+ - `max_tokens_to_sample` (int): Maximum tokens in response
232
+ - `temperature` (float): Controls randomness (0-1)
233
+
234
+ ```python
235
+ model = CloseLLM()
236
+ model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
237
+ result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')
238
+ ```
239
+
240
+ #### Method: `prompt`
241
+ **Parameters**
242
+ - `message` (str): The input message text
243
+ - `model_engine` (str): Model to use
244
+ - `temperature` (float): Controls randomness (0-1)
245
+ - `max_tokens` (int): Maximum tokens in response
246
+
247
+ ```python
248
+ from qa_metrics.prompt_open_llm import OpenLLM
249
+
250
+ model = OpenLLM()
251
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
252
+ result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')
253
+ ```
254
+
255
+ ## 🤗 Model Hub
256
+
257
+ Our fine-tuned models are available on Huggingface:
258
+ - [BERT](https://huggingface.co/Zongxia/answer_equivalence_bert)
259
+ - [DistilRoBERTa](https://huggingface.co/Zongxia/answer_equivalence_distilroberta)
260
+ - [DistilBERT](https://huggingface.co/Zongxia/answer_equivalence_distilbert)
261
+ - [RoBERTa](https://huggingface.co/Zongxia/answer_equivalence_roberta)
262
+ - [Tiny-BERT](https://huggingface.co/Zongxia/answer_equivalence_tiny_bert)
263
+ - [RoBERTa-Large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large)
264
+
265
+ ## 📚 Resources
266
+
267
+ - [Full Paper](https://arxiv.org/abs/2402.11161)
268
+ - [Dataset Repository](https://github.com/zli12321/Answer_Equivalence_Dataset.git)
269
+ - [Supported Models on Deepinfra](https://deepinfra.com/models)
270
+
271
+ ## 📄 Citation
272
+
273
+ ```bibtex
274
+ @misc{li2024pedantspreciseevaluationsdiverse,
275
+ title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence},
276
+ author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
277
+ year={2024},
278
+ eprint={2402.11161},
279
+ archivePrefix={arXiv},
280
+ primaryClass={cs.CL},
281
+ url={https://arxiv.org/abs/2402.11161},
282
+ }
283
+ ```
284
+
285
+ ## 📝 License
286
+
287
+ This project is licensed under the [MIT License](LICENSE.md).
288
+
289
+ ## 📬 Contact
290
+
291
+ For questions or comments, please contact: [email protected]
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/srv/www/active-topic-modeling/ae_tune/models--google--bert_uncased_L-4_H-256_A-4/snapshots/387825ce42dbb39b87911cdf8e383ee3b25184f8",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 256,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 1024,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 4,
17
+ "num_hidden_layers": 4,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "problem_type": "single_label_classification",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.37.2",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c585ee3f29794eb0dbf01bada1ba6b3bff9a1bec1dbde62f997b74c5d942b50d
3
+ size 44692608
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff