Bert mini L4

Browse files

Files changed (7) hide show

README.md +291 -3
config.json +26 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +57 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,291 @@
----
-license: apache-2.0
----

+---
+inference: false
+license: mit
+language:
+- en
+metrics:
+- exact_match
+- f1
+- bertscore
+pipeline_tag: text-classification
+---
+# QA-Evaluation-Metrics 📊
+[![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
+[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ke23KIeHFdPWad0BModmcWKZ6jSbF5nI?usp=sharing)
+> Check out the main [Repo](https://github.com/zli12321/qa_metrics)
+> A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.
+## 🎉 Latest Updates
+- **Version 0.2.19 Released!**
+  - Paper accepted to EMNLP 2024 Findings! 🎓
+  - Enhanced PEDANTS with multi-pipeline support and improved edge case handling
+  - Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
+  - Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via [deepinfra](https://deepinfra.com/models)
+  - Introduced trained tiny-bert for QA evaluation (18MB model size)
+  - Added direct Huggingface model download support for TransformerMatcher
+## 🚀 Quick Start
+### Prerequisites
+- Python >= 3.6
+- openai >= 1.0
+### Installation
+```bash
+pip install qa-metrics
+```
+## 💡 Features
+Our package offers six QA evaluation methods with varying strengths:
+| Method | Best For | Cost | Correlation with Human Judgment |
+|--------|----------|------|--------------------------------|
+| Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |
+| PEDANTS | Both short & medium-form QA | Free | Very High |
+| [Neural Evaluation](https://huggingface.co/zli12321/answer_equivalence_tiny_bert) | Both short & long-form QA | Free | High |
+| [Open Source LLM Evaluation](https://huggingface.co/zli12321/prometheus2-2B) | All QA types | Free | High |
+| Black-box LLM Evaluation | All QA types | Paid | Highest |
+## 📖 Documentation
+### 1. Normalized Exact Match
+#### Method: `em_match`
+**Parameters**
+- `reference_answer` (list of str): A list of gold (correct) answers to the question
+- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
+**Returns**
+- `boolean`: True if there are any exact normalized matches between gold and candidate answers
+```python
+from qa_metrics.em import em_match
+reference_answer = ["The Frog Prince", "The Princess and the Frog"]
+candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
+match_result = em_match(reference_answer, candidate_answer)
+```
+### 2. F1 Score
+#### Method: `f1_score_with_precision_recall`
+**Parameters**
+- `reference_answer` (str): A gold (correct) answer to the question
+- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated
+**Returns**
+- `dictionary`: Contains the F1 score, precision, and recall between a gold and candidate answer
+#### Method: `f1_match`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `candidate_answer` (str): Candidate answer to evaluate
+- `threshold` (float): F1 score threshold for considering a match (default: 0.5)
+**Returns**
+- `boolean`: True if F1 score exceeds threshold for any gold answer
+```python
+from qa_metrics.f1 import f1_match, f1_score_with_precision_recall
+f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
+match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
+```
+### 3. PEDANTS
+#### Method: `get_score`
+**Parameters**
+- `reference_answer` (str): A Gold answer
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `float`: The similarity score between two strings (0 to 1)
+#### Method: `get_highest_score`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
+#### Method: `get_scores`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
+#### Method: `evaluate`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `boolean`: True if candidate answer matches any gold answer
+#### Method: `get_question_type`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `question` (str): The question being evaluated
+**Returns**
+- `list`: The type of the question (what, who, when, how, why, which, where)
+#### Method: `get_judgement_type`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `list`: A list revised rules applicable to judge answer correctness
+```python
+from qa_metrics.pedant import PEDANT
+pedant = PEDANT()
+scores = pedant.get_scores(reference_answer, candidate_answer, question)
+match_result = pedant.evaluate(reference_answer, candidate_answer, question)
+```
+### 4. Transformer Neural Evaluation
+#### Method: `get_score`
+**Parameters**
+- `reference_answer` (str): A Gold answer
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `float`: The similarity score between two strings (0 to 1)
+#### Method: `get_highest_score`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `dictionary`: Contains the gold answer and candidate answer pair with highest matching score
+#### Method: `get_scores`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `dictionary`: Contains matching scores for all gold answer and candidate answer pairs
+#### Method: `transformer_match`
+**Parameters**
+- `reference_answer` (list of str): List of gold answers
+- `candidate_answer` (str): Candidate answer to evaluate
+- `question` (str): The question being evaluated
+**Returns**
+- `boolean`: True if transformer model considers candidate answer equivalent to any gold answer
+```python
+from qa_metrics.transformerMatcher import TransformerMatcher
+### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
+tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
+match_result = tm.transformer_match(reference_answer, candidate_answer, question)
+```
+### 5. LLM Integration
+#### Method: `prompt_gpt`
+**Parameters**
+- `prompt` (str): The input prompt text
+- `model_engine` (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
+- `temperature` (float): Controls randomness (0-1)
+- `max_tokens` (int): Maximum tokens in response
+```python
+from qa_metrics.prompt_llm import CloseLLM
+model = CloseLLM()
+model.set_openai_api_key(YOUR_OPENAI_KEY)
+result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')
+```
+#### Method: `prompt_claude`
+**Parameters**
+- `prompt` (str): The input prompt text
+- `model_engine` (str): Claude model to use
+- `anthropic_version` (str): API version
+- `max_tokens_to_sample` (int): Maximum tokens in response
+- `temperature` (float): Controls randomness (0-1)
+```python
+model = CloseLLM()
+model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
+result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')
+```
+#### Method: `prompt`
+**Parameters**
+- `message` (str): The input message text
+- `model_engine` (str): Model to use
+- `temperature` (float): Controls randomness (0-1)
+- `max_tokens` (int): Maximum tokens in response
+```python
+from qa_metrics.prompt_open_llm import OpenLLM
+model = OpenLLM()
+model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
+result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')
+```
+## 🤗 Model Hub
+Our fine-tuned models are available on Huggingface:
+- [BERT](https://huggingface.co/Zongxia/answer_equivalence_bert)
+- [DistilRoBERTa](https://huggingface.co/Zongxia/answer_equivalence_distilroberta)
+- [DistilBERT](https://huggingface.co/Zongxia/answer_equivalence_distilbert)
+- [RoBERTa](https://huggingface.co/Zongxia/answer_equivalence_roberta)
+- [Tiny-BERT](https://huggingface.co/Zongxia/answer_equivalence_tiny_bert)
+- [RoBERTa-Large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large)
+## 📚 Resources
+- [Full Paper](https://arxiv.org/abs/2402.11161)
+- [Dataset Repository](https://github.com/zli12321/Answer_Equivalence_Dataset.git)
+- [Supported Models on Deepinfra](https://deepinfra.com/models)
+## 📄 Citation
+```bibtex
+@misc{li2024pedantspreciseevaluationsdiverse,
+      title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence},
+      author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
+      year={2024},
+      eprint={2402.11161},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2402.11161},
+}
+```
+## 📝 License
+This project is licensed under the [MIT License](LICENSE.md).
+## 📬 Contact
+For questions or comments, please contact: [email protected]

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "_name_or_path": "/srv/www/active-topic-modeling/ae_tune/models--google--bert_uncased_L-4_H-256_A-4/snapshots/387825ce42dbb39b87911cdf8e383ee3b25184f8",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 256,
+  "initializer_range": 0.02,
+  "intermediate_size": 1024,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 4,
+  "num_hidden_layers": 4,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.37.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c585ee3f29794eb0dbf01bada1ba6b3bff9a1bec1dbde62f997b74c5d942b50d
+size 44692608

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff