---
library_name: transformers
tags: []
---

## Merged Model Performance

This repository contains the results of our merged rag relevance PEFT adapter model.

### Classification Performance

Our merged model achieves the following performance on a binary classification task:

```
              precision    recall  f1-score   support

           0       0.74      0.77      0.75       100
           1       0.76      0.73      0.74       100

    accuracy                           0.75       200
   macro avg       0.75      0.75      0.75       200
weighted avg       0.75      0.75      0.75       200
```

### Comparison with Other Models

We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models:

| Model                 | Precision | Recall | F1     |
|---------------------- |----------:|-------:|-------:|
| Our Merged Model      | 0.74      | 0.77   | 0.75   |
| GPT-4                 | 0.70      | 0.88   | 0.78   |
| GPT-4 Turbo           | 0.68      | 0.91   | 0.78   |
| Gemini Pro            | 0.61      | 1.00   | 0.76   |
| GPT-3.5               | 0.42      | 1.00   | 0.59   |
| Palm (Text Bison)     | 0.53      | 1.00   | 0.69   |
[1] Scores from arize/phoenix

As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models.

We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.

Citations:
[1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance