Jlonge4 commited on
Commit
46b8283
1 Parent(s): 71f03c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -47
README.md CHANGED
@@ -11,50 +11,81 @@ model-index:
11
  results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
- # outputs
18
-
19
- This model is a fine-tuned version of [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) on the None dataset.
20
-
21
- ## Model description
22
-
23
- More information needed
24
-
25
- ## Intended uses & limitations
26
-
27
- More information needed
28
-
29
- ## Training and evaluation data
30
-
31
- More information needed
32
-
33
- ## Training procedure
34
-
35
- ### Training hyperparameters
36
-
37
- The following hyperparameters were used during training:
38
- - learning_rate: 0.0005
39
- - train_batch_size: 2
40
- - eval_batch_size: 8
41
- - seed: 42
42
- - gradient_accumulation_steps: 4
43
- - total_train_batch_size: 8
44
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
45
- - lr_scheduler_type: linear
46
- - lr_scheduler_warmup_steps: 10
47
- - training_steps: 60
48
- - mixed_precision_training: Native AMP
49
-
50
- ### Training results
51
-
52
-
53
-
54
- ### Framework versions
55
-
56
- - PEFT 0.11.1
57
- - Transformers 4.41.2
58
- - Pytorch 2.3.0+cu121
59
- - Datasets 2.19.1
60
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  results: []
12
  ---
13
 
14
+ ## Merged Model Performance
15
+
16
+ This repository contains our RAG relevance PEFT adapter model.
17
+
18
+ ### RAG Relevance Classification Metrics
19
+
20
+ Our merged model achieves the following performance on a binary classification task:
21
+
22
+ ```
23
+ precision recall f1-score support
24
+
25
+ 0 0.74 0.77 0.75 100
26
+ 1 0.76 0.73 0.74 100
27
+
28
+ accuracy 0.75 200
29
+ macro avg 0.75 0.75 0.75 200
30
+ weighted avg 0.75 0.75 0.75 200
31
+ ```
32
+
33
+ ### Model Usage
34
+ For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):
35
+
36
+ ```python
37
+ def format_input_classification(query, text):
38
+ input = f"""
39
+ You are comparing a reference text to a question and trying to determine if the reference text
40
+ contains information relevant to answering the question. Here is the data:
41
+ [BEGIN DATA]
42
+ ************
43
+ [Question]: {query}
44
+ ************
45
+ [Reference text]: {text}
46
+ ************
47
+ [END DATA]
48
+ Compare the Question above to the Reference text. You must determine whether the Reference text
49
+ contains information that can answer the Question. Please focus on whether the very specific
50
+ question can be answered by the information in the Reference text.
51
+ Your response must be single word, either "relevant" or "unrelated",
52
+ and should not contain any text or characters aside from that word.
53
+ "unrelated" means that the reference text does not contain an answer to the Question.
54
+ "relevant" means the reference text contains an answer to the Question."""
55
+ return input
56
+
57
+
58
+ text = format_input_classification("What is quanitzation?",
59
+ "Quantization is a method to reduce the memory footprint")
60
+ messages = [
61
+ {"role": "user", "content": text}
62
+ ]
63
+
64
+ pipe = pipeline(
65
+ "text-generation",
66
+ model=base_model,
67
+ model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
68
+ tokenizer=tokenizer,
69
+ )
70
+ ```
71
+
72
+ ### Comparison with Other Models
73
+
74
+ We compared our merged model's performance on the RAG Eval benchmark against several other state-of-the-art language models:
75
+
76
+ | Model | Precision | Recall | F1 |
77
+ |---------------------- |----------:|-------:|-------:|
78
+ | Our Merged Model | 0.74 | 0.77 | 0.75 |
79
+ | GPT-4 | 0.70 | 0.88 | 0.78 |
80
+ | GPT-4 Turbo | 0.68 | 0.91 | 0.78 |
81
+ | Gemini Pro | 0.61 | 1.00 | 0.76 |
82
+ | GPT-3.5 | 0.42 | 1.00 | 0.59 |
83
+ | Palm (Text Bison) | 0.53 | 1.00 | 0.69 |
84
+ [1] Scores from arize/phoenix
85
+
86
+ As shown in the table, our merged model achieves a comparable score of 0.75, outperforming several other black box models.
87
+
88
+ We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.
89
+
90
+ Citations:
91
+ [1] https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/retrieval-rag-relevance