File size: 1,626 Bytes
7b4dbd7
 
 
 
 
e6d806e
7b4dbd7
0f7ea1f
7b4dbd7
e6d806e
7b4dbd7
e6d806e
7b4dbd7
e6d806e
 
7b4dbd7
e6d806e
 
7b4dbd7
e6d806e
 
 
 
7b4dbd7
e6d806e
7b4dbd7
e6d806e
7b4dbd7
e6d806e
 
b9bb7d1
e6d806e
 
 
 
 
 
7b4dbd7
0f7ea1f
7b4dbd7
e6d806e
7b4dbd7
e6d806e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
library_name: transformers
tags: []
---

## Merged Model Performance

This repository contains the results of our merged rag relevance PEFT adapter model.

### Classification Performance

Our merged model achieves the following performance on a binary classification task:

```
              precision    recall  f1-score   support

           0       0.74      0.77      0.75       100
           1       0.76      0.73      0.74       100

    accuracy                           0.75       200
   macro avg       0.75      0.75      0.75       200
weighted avg       0.75      0.75      0.75       200
```

### Comparison with Other Models

We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models:

| Model                 | Precision | Recall | F1     |
|---------------------- |----------:|-------:|-------:|
| Our Merged Model      | 0.74      | 0.77   | 0.75   |
| GPT-4                 | 0.70      | 0.88   | 0.78   |
| GPT-4 Turbo           | 0.68      | 0.91   | 0.78   |
| Gemini Pro            | 0.61      | 1.00   | 0.76   |
| GPT-3.5               | 0.42      | 1.00   | 0.59   |
| Palm (Text Bison)     | 0.53      | 1.00   | 0.69   |
[1] Scores from arize/phoenix

As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models.

We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.

Citations:
[1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance