--- library_name: transformers tags: [] --- ## Merged Model Performance This repository contains the results of our merged rag relevance PEFT adapter model. ### Classification Performance Our merged model achieves the following performance on a binary classification task: ``` precision recall f1-score support 0 0.74 0.77 0.75 100 1 0.76 0.73 0.74 100 accuracy 0.75 200 macro avg 0.75 0.75 0.75 200 weighted avg 0.75 0.75 0.75 200 ``` ### Comparison with Other Models We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models: | Model | Precision | Recall | F1 | |---------------------- |----------:|-------:|-------:| | Our Merged Model | 0.74 | 0.77 | 0.75 | | GPT-4 | 0.70 | 0.88 | 0.78 | | GPT-4 Turbo | 0.68 | 0.91 | 0.78 | | Gemini Pro | 0.61 | 1.00 | 0.76 | | GPT-3.5 | 0.42 | 1.00 | 0.59 | | Palm (Text Bison) | 0.53 | 1.00 | 0.69 | [1] Scores from arize/phoenix As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models. We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks. Citations: [1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance