File size: 1,626 Bytes
7b4dbd7 e6d806e 7b4dbd7 0f7ea1f 7b4dbd7 e6d806e 7b4dbd7 e6d806e 7b4dbd7 e6d806e 7b4dbd7 e6d806e 7b4dbd7 e6d806e 7b4dbd7 e6d806e 7b4dbd7 e6d806e 7b4dbd7 e6d806e b9bb7d1 e6d806e 7b4dbd7 0f7ea1f 7b4dbd7 e6d806e 7b4dbd7 e6d806e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
---
library_name: transformers
tags: []
---
## Merged Model Performance
This repository contains the results of our merged rag relevance PEFT adapter model.
### Classification Performance
Our merged model achieves the following performance on a binary classification task:
```
precision recall f1-score support
0 0.74 0.77 0.75 100
1 0.76 0.73 0.74 100
accuracy 0.75 200
macro avg 0.75 0.75 0.75 200
weighted avg 0.75 0.75 0.75 200
```
### Comparison with Other Models
We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models:
| Model | Precision | Recall | F1 |
|---------------------- |----------:|-------:|-------:|
| Our Merged Model | 0.74 | 0.77 | 0.75 |
| GPT-4 | 0.70 | 0.88 | 0.78 |
| GPT-4 Turbo | 0.68 | 0.91 | 0.78 |
| Gemini Pro | 0.61 | 1.00 | 0.76 |
| GPT-3.5 | 0.42 | 1.00 | 0.59 |
| Palm (Text Bison) | 0.53 | 1.00 | 0.69 |
[1] Scores from arize/phoenix
As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models.
We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.
Citations:
[1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance |