File size: 6,302 Bytes
8430d55
 
 
 
 
 
 
 
 
 
 
 
 
7b94199
a9cc35d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ecd914
 
a9cc35d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad19736
8430d55
 
 
a9cc35d
8430d55
 
 
a9cc35d
 
 
 
 
 
8430d55
 
 
a9cc35d
 
 
 
 
 
 
 
 
 
 
8430d55
 
 
 
 
 
e7e4308
 
8430d55
 
e7e4308
8430d55
 
e7e4308
8430d55
e7e4308
8430d55
 
 
e7e4308
 
7b94199
 
 
 
 
 
 
 
 
 
 
 
 
8430d55
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
base_model: microsoft/Phi-3.5-mini-instruct
library_name: peft
license: mit
tags:
- trl
- sft
- generated_from_trainer
model-index:
- name: outputs
  results: []
---

[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/josh-longenecker1-groundedai/phi3.5-hallucination/runs/re0kg3gs)

## Merged Model Performance

This repository contains our hallucination evaluation PEFT adapter model.

### Hallucination Detection Metrics

Our merged model achieves the following performance on a binary classification task for detecting hallucinations in language model outputs:

```
              precision    recall  f1-score   support

           0       0.77      0.91      0.83       100
           1       0.89      0.73      0.80       100

    accuracy                           0.82       200
   macro avg       0.83      0.82      0.82       200
weighted avg       0.83      0.82      0.82       200
```

### Model Usage
For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):

```python
def format_input(reference, query, response):
    prompt = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
    A hallucination occurs when the response is coherent but factually incorrect or nonsensical
    outputs that are not grounded in the provided context.
    You are given the following information:
    ####INFO####
    [Knowledge]: {reference}
    [User Input]: {query}
    [Model Response]: {response}
    ####END INFO####
    Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
    """
    return input

text = format_input(reference="The apple mac has the best hardware",
          query="What computer has the best software?",
          response="Apple mac")

messages = [
    {"role": "user", "content": text}
]

pipe = pipeline(
    "text-generation",
    model=base_model,
    model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
    tokenizer=tokenizer,
)
generation_args = {
      "max_new_tokens": 2,
      "return_full_text": False,
      "temperature": 0.01,
      "do_sample": True,
  }

output = pipe(messages, **generation_args)
print(f'Hallucination: {output['generated_text'].strip().lower()}')
# Hallucination: yes
```

### Comparison with Other Models

We compared our merged model's performance on the hallucination detection benchmark against several other state-of-the-art language models:

| Model                 | Precision | Recall | F1     |
|---------------------- |----------:|-------:|-------:|
| Our Merged Model      | 0.77      | 0.91   | 0.83   |
| GPT-4                 | 0.93      | 0.72   | 0.82   |
| GPT-4 Turbo           | 0.97      | 0.70   | 0.81   |
| Gemini Pro            | 0.89      | 0.53   | 0.67   |
| GPT-3.5               | 0.89      | 0.65   | 0.75   |
| GPT-3.5-turbo-instruct| 0.89      | 0.80   | 0.84   |
| Palm 2 (Text Bison)   | 1.00      | 0.44   | 0.61   |
| Claude V2             | 0.80      | 0.95   | 0.87   |

Scores from arize/phoenix

As shown in the table, our merged model achieves competitive performance, with an F1 score of 0.83, matching or outperforming several state-of-the-art language models on this hallucination detection task.

## Model description

This model is a fine-tuned version of the Phi-3.5-mini-instruct model, specifically adapted for hallucination detection. It has been trained on the HaluEval dataset to identify when language model outputs contain hallucinations - responses that are coherent but factually incorrect or not grounded in the provided context.

## Intended uses & limitations

This model is intended for use in evaluating the outputs of language models to detect potential hallucinations. It can be integrated into pipelines for content validation, fact-checking, or as a component in larger systems aimed at improving the reliability of AI-generated content.

Limitations:
- The model's performance may vary depending on the domain and complexity of the input.
- It may not catch all types of hallucinations, especially those that are subtle or require extensive domain knowledge.
- The model should be used as part of a broader strategy for ensuring AI output quality, not as a sole arbiter of truth.

## Training and evaluation data

This model was trained using the HaluEval dataset:

@misc{HaluEval,
  author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian-Yun Nie and Ji-Rong Wen },
  title = {HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models},
  year = {2023},
  journal={arXiv preprint arXiv:2305.11747},
  url={https://arxiv.org/abs/2305.11747}
}

The HaluEval dataset is specifically designed for evaluating hallucinations in large language models, making it an ideal choice for training our hallucination detection model.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 20
- training_steps: 100

### Training results

| Training Loss | Epoch  | Step | Validation Loss |
|:-------------:|:------:|:----:|:---------------:|
| 2.2594        | 0.5263 | 5    | 2.2572          |
| 1.6785        | 1.0526 | 10   | 1.8170          |
| 1.6015        | 1.5789 | 15   | 1.4296          |
| 1.0556        | 2.1053 | 20   | 1.1199          |
| 0.9412        | 2.6316 | 25   | 1.0660          |
| 0.8872        | 3.1579 | 30   | 1.0523          |
| 0.9157        | 3.6842 | 35   | 1.0713          |
| 0.7735        | 4.2105 | 40   | 1.0983          |
| 0.6182        | 4.7368 | 45   | 1.0816          |
| 0.734         | 5.2632 | 50   | 1.1017          |
| 0.4736        | 5.7895 | 55   | 1.2109          |
| 0.3138        | 6.3158 | 60   | 1.2195          |
| 0.5315        | 6.8421 | 65   | 1.3147          |

### Framework versions

- PEFT 0.12.0
- Transformers 4.44.2
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1