phi3.5-hallucination-judge / README.md

Jlonge4

Update README.md

6ecd914 verified 17 days ago

preview code

raw

history blame

No virus

6.3 kB

	---
	base_model: microsoft/Phi-3.5-mini-instruct
	library_name: peft
	license: mit
	tags:
	- trl
	- sft
	- generated_from_trainer
	model-index:
	- name: outputs
	results: []
	---

	[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/josh-longenecker1-groundedai/phi3.5-hallucination/runs/re0kg3gs)

	## Merged Model Performance

	This repository contains our hallucination evaluation PEFT adapter model.

	### Hallucination Detection Metrics

	Our merged model achieves the following performance on a binary classification task for detecting hallucinations in language model outputs:

	```
	precision recall f1-score support

	0 0.77 0.91 0.83 100
	1 0.89 0.73 0.80 100

	accuracy 0.82 200
	macro avg 0.83 0.82 0.82 200
	weighted avg 0.83 0.82 0.82 200
	```

	### Model Usage
	For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):

	```python
	def format_input(reference, query, response):
	prompt = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
	A hallucination occurs when the response is coherent but factually incorrect or nonsensical
	outputs that are not grounded in the provided context.
	You are given the following information:
	####INFO####
	[Knowledge]: {reference}
	[User Input]: {query}
	[Model Response]: {response}
	####END INFO####
	Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
	"""
	return input

	text = format_input(reference="The apple mac has the best hardware",
	query="What computer has the best software?",
	response="Apple mac")

	messages = [
	{"role": "user", "content": text}
	]

	pipe = pipeline(
	"text-generation",
	model=base_model,
	model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
	tokenizer=tokenizer,
	)
	generation_args = {
	"max_new_tokens": 2,
	"return_full_text": False,
	"temperature": 0.01,
	"do_sample": True,
	}

	output = pipe(messages, **generation_args)
	print(f'Hallucination: {output['generated_text'].strip().lower()}')
	# Hallucination: yes
	```

	### Comparison with Other Models

	We compared our merged model's performance on the hallucination detection benchmark against several other state-of-the-art language models:

	\| Model \| Precision \| Recall \| F1 \|
	\|---------------------- \|----------:\|-------:\|-------:\|
	\| Our Merged Model \| 0.77 \| 0.91 \| 0.83 \|
	\| GPT-4 \| 0.93 \| 0.72 \| 0.82 \|
	\| GPT-4 Turbo \| 0.97 \| 0.70 \| 0.81 \|
	\| Gemini Pro \| 0.89 \| 0.53 \| 0.67 \|
	\| GPT-3.5 \| 0.89 \| 0.65 \| 0.75 \|
	\| GPT-3.5-turbo-instruct\| 0.89 \| 0.80 \| 0.84 \|
	\| Palm 2 (Text Bison) \| 1.00 \| 0.44 \| 0.61 \|
	\| Claude V2 \| 0.80 \| 0.95 \| 0.87 \|

	Scores from arize/phoenix

	As shown in the table, our merged model achieves competitive performance, with an F1 score of 0.83, matching or outperforming several state-of-the-art language models on this hallucination detection task.

	## Model description

	This model is a fine-tuned version of the Phi-3.5-mini-instruct model, specifically adapted for hallucination detection. It has been trained on the HaluEval dataset to identify when language model outputs contain hallucinations - responses that are coherent but factually incorrect or not grounded in the provided context.

	## Intended uses & limitations

	This model is intended for use in evaluating the outputs of language models to detect potential hallucinations. It can be integrated into pipelines for content validation, fact-checking, or as a component in larger systems aimed at improving the reliability of AI-generated content.

	Limitations:
	- The model's performance may vary depending on the domain and complexity of the input.
	- It may not catch all types of hallucinations, especially those that are subtle or require extensive domain knowledge.
	- The model should be used as part of a broader strategy for ensuring AI output quality, not as a sole arbiter of truth.

	## Training and evaluation data

	This model was trained using the HaluEval dataset:

	@misc{HaluEval,
	author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian-Yun Nie and Ji-Rong Wen },
	title = {HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models},
	year = {2023},
	journal={arXiv preprint arXiv:2305.11747},
	url={https://arxiv.org/abs/2305.11747}
	}

	The HaluEval dataset is specifically designed for evaluating hallucinations in large language models, making it an ideal choice for training our hallucination detection model.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 2
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 4
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 20
	- training_steps: 100

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|
	\| 2.2594 \| 0.5263 \| 5 \| 2.2572 \|
	\| 1.6785 \| 1.0526 \| 10 \| 1.8170 \|
	\| 1.6015 \| 1.5789 \| 15 \| 1.4296 \|
	\| 1.0556 \| 2.1053 \| 20 \| 1.1199 \|
	\| 0.9412 \| 2.6316 \| 25 \| 1.0660 \|
	\| 0.8872 \| 3.1579 \| 30 \| 1.0523 \|
	\| 0.9157 \| 3.6842 \| 35 \| 1.0713 \|
	\| 0.7735 \| 4.2105 \| 40 \| 1.0983 \|
	\| 0.6182 \| 4.7368 \| 45 \| 1.0816 \|
	\| 0.734 \| 5.2632 \| 50 \| 1.1017 \|
	\| 0.4736 \| 5.7895 \| 55 \| 1.2109 \|
	\| 0.3138 \| 6.3158 \| 60 \| 1.2195 \|
	\| 0.5315 \| 6.8421 \| 65 \| 1.3147 \|

	### Framework versions

	- PEFT 0.12.0
	- Transformers 4.44.2
	- Pytorch 2.4.0+cu121
	- Datasets 2.21.0
	- Tokenizers 0.19.1