README.md · grounded-ai/phi3-rag-relevance-judge-merge at 0f7ea1f7f3b9338683e1efca42dea3982b0a480c

metadata

library_name: transformers
tags: []

Merged Model Performance

This repository contains the results of our merged rag relevance PEFT adapter model.

Classification Performance

Our merged model achieves the following performance on a binary classification task:

              precision    recall  f1-score   support

           0       0.74      0.77      0.75       100
           1       0.76      0.73      0.74       100

    accuracy                           0.75       200
   macro avg       0.75      0.75      0.75       200
weighted avg       0.75      0.75      0.75       200

Comparison with Other Models

We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models:

Model	Precision	Recall	F1
Our Merged Model	0.74	0.77	0.75
GPT-4	0.70	0.88	0.78
GPT-4 Turbo	0.68	0.91	0.78
Gemini Pro	0.61	1.00	0.76
GPT-3.5	0.42	1.00	0.59
Palm (Text Bison)	0.53	1.00	0.69
[1] Scores from arize/phoenix

As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models.

We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.

Citations: [1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance