File size: 3,120 Bytes
7b4dbd7
 
 
 
 
e6d806e
7b4dbd7
0f7ea1f
7b4dbd7
f2f082b
7b4dbd7
e6d806e
7b4dbd7
e6d806e
 
7b4dbd7
e6d806e
 
7b4dbd7
e6d806e
 
 
 
7b4dbd7
f2f082b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6d806e
7b4dbd7
e6d806e
7b4dbd7
e6d806e
 
b9bb7d1
e6d806e
 
 
 
 
 
7b4dbd7
0f7ea1f
7b4dbd7
e6d806e
7b4dbd7
e6d806e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
library_name: transformers
tags: []
---

## Merged Model Performance

This repository contains the results of our merged rag relevance PEFT adapter model.

### RAG Relevance Classification Metrics

Our merged model achieves the following performance on a binary classification task:

```
              precision    recall  f1-score   support

           0       0.74      0.77      0.75       100
           1       0.76      0.73      0.74       100

    accuracy                           0.75       200
   macro avg       0.75      0.75      0.75       200
weighted avg       0.75      0.75      0.75       200
```

### Model Usage
For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):

```python
def format_input_classification(query, text):
    input = f"""
      You are comparing a reference text to a question and trying to determine if the reference text
  contains information relevant to answering the question. Here is the data:
      [BEGIN DATA]
      ************
      [Question]: {query}
      ************
      [Reference text]: {text}
      ************
      [END DATA]
  Compare the Question above to the Reference text. You must determine whether the Reference text
  contains information that can answer the Question. Please focus on whether the very specific
  question can be answered by the information in the Reference text.
  Your response must be single word, either "relevant" or "unrelated",
  and should not contain any text or characters aside from that word.
  "unrelated" means that the reference text does not contain an answer to the Question.
  "relevant" means the reference text contains an answer to the Question."""
    return input


text = format_input_classification("What is quanitzation?",
  "Quantization is a method to reduce the memory footprint")
messages = [
    {"role": "user", "content": text}
]

pipe = pipeline(
    "text-generation",
    model=base_model,
    model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
    tokenizer=tokenizer,
)
```

### Comparison with Other Models

We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models:

| Model                 | Precision | Recall | F1     |
|---------------------- |----------:|-------:|-------:|
| Our Merged Model      | 0.74      | 0.77   | 0.75   |
| GPT-4                 | 0.70      | 0.88   | 0.78   |
| GPT-4 Turbo           | 0.68      | 0.91   | 0.78   |
| Gemini Pro            | 0.61      | 1.00   | 0.76   |
| GPT-3.5               | 0.42      | 1.00   | 0.59   |
| Palm (Text Bison)     | 0.53      | 1.00   | 0.69   |
[1] Scores from arize/phoenix

As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models.

We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.

Citations:
[1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance