metadata
library_name: transformers
tags: []
Merged Model Performance
This repository contains the results of our merged rag relevance PEFT adapter model.
Classification Performance
Our merged model achieves the following performance on a binary classification task:
precision recall f1-score support
0 0.74 0.77 0.75 100
1 0.76 0.73 0.74 100
accuracy 0.75 200
macro avg 0.75 0.75 0.75 200
weighted avg 0.75 0.75 0.75 200
Comparison with Other Models
We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models:
Model | Precision | Recall | F1 |
---|---|---|---|
Our Merged Model | 0.74 | 0.77 | 0.75 |
GPT-4 | 0.70 | 0.88 | 0.78 |
GPT-4 Turbo | 0.68 | 0.91 | 0.78 |
Gemini Pro | 0.61 | 1.00 | 0.76 |
GPT-3.5 | 0.42 | 1.00 | 0.59 |
Palm (Text Bison) | 0.53 | 1.00 | 0.69 |
[1] Scores from arize/phoenix |
As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models.
We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.
Citations: [1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance