grounded-ai
/

phi3-rag-relevance-judge

@@ -11,50 +11,81 @@ model-index:
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# outputs
-This model is a fine-tuned version of [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) on the None dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0005
-- train_batch_size: 2
-- eval_batch_size: 8
-- seed: 42
-- gradient_accumulation_steps: 4
-- total_train_batch_size: 8
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 10
-- training_steps: 60
-- mixed_precision_training: Native AMP
-### Training results
-### Framework versions
-- PEFT 0.11.1
-- Transformers 4.41.2
-- Pytorch 2.3.0+cu121
-- Datasets 2.19.1
-- Tokenizers 0.19.1

   results: []
 ---
+## Merged Model Performance
+This repository contains our RAG relevance PEFT adapter model.
+### RAG Relevance Classification Metrics
+Our merged model achieves the following performance on a binary classification task:
+```
+              precision    recall  f1-score   support
+           0       0.74      0.77      0.75       100
+           1       0.76      0.73      0.74       100
+    accuracy                           0.75       200
+   macro avg       0.75      0.75      0.75       200
+weighted avg       0.75      0.75      0.75       200
+```
+### Model Usage
+For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):
+```python
+def format_input_classification(query, text):
+    input = f"""
+      You are comparing a reference text to a question and trying to determine if the reference text
+  contains information relevant to answering the question. Here is the data:
+      [BEGIN DATA]
+      ************
+      [Question]: {query}
+      ************
+      [Reference text]: {text}
+      ************
+      [END DATA]
+  Compare the Question above to the Reference text. You must determine whether the Reference text
+  contains information that can answer the Question. Please focus on whether the very specific
+  question can be answered by the information in the Reference text.
+  Your response must be single word, either "relevant" or "unrelated",
+  and should not contain any text or characters aside from that word.
+  "unrelated" means that the reference text does not contain an answer to the Question.
+  "relevant" means the reference text contains an answer to the Question."""
+    return input
+text = format_input_classification("What is quanitzation?",
+  "Quantization is a method to reduce the memory footprint")
+messages = [
+    {"role": "user", "content": text}
+]
+pipe = pipeline(
+    "text-generation",
+    model=base_model,
+    model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
+    tokenizer=tokenizer,
+)
+```
+### Comparison with Other Models
+We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models:
+| Model                 | Precision | Recall | F1     |
+|---------------------- |----------:|-------:|-------:|
+| Our Merged Model      | 0.74      | 0.77   | 0.75   |
+| GPT-4                 | 0.70      | 0.88   | 0.78   |
+| GPT-4 Turbo           | 0.68      | 0.91   | 0.78   |
+| Gemini Pro            | 0.61      | 1.00   | 0.76   |
+| GPT-3.5               | 0.42      | 1.00   | 0.59   |
+| Palm (Text Bison)     | 0.53      | 1.00   | 0.69   |
+[1] Scores from arize/phoenix
+As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models.
+We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.
+Citations:
+[1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance