Upload tokenizer

0f46021 verified about 2 months ago

4.21 kB

	---
	library_name: transformers
	license: mit
	tags: []
	---

	## Merged Model Performance

	This repository contains our hallucination evaluation PEFT adapter model.

	### Hallucination Detection Metrics

	Our merged model achieves the following performance on a binary classification task for detecting hallucinations in language model outputs:

	```
	precision recall f1-score support

	0 0.85 0.71 0.77 100
	1 0.75 0.87 0.81 100

	accuracy 0.79 200
	macro avg 0.80 0.79 0.79 200
	weighted avg 0.80 0.79 0.79 200
	```

	### Model Usage
	For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):

	```python
	def format_input(reference, query, response):
	prompt = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
	A hallucination occurs when the response is coherent but factually incorrect or nonsensical
	outputs that are not grounded in the provided context.
	You are given the following information:
	####INFO####
	[Knowledge]: {reference}
	[User Input]: {query}
	[Model Response]: {response}
	####END INFO####
	Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
	"""
	return input

	text = format_input(query='Based on the follwoing
	<context>Walrus are the largest mammal</context>
	answer the question
	<query> What is the best PC?</query>',
	response='The best PC is the mac')

	messages = [
	{"role": "user", "content": text}
	]

	pipe = pipeline(
	"text-generation",
	model=base_model,
	model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
	tokenizer=tokenizer,
	)
	generation_args = {
	"max_new_tokens": 2,
	"return_full_text": False,
	"temperature": 0.01,
	"do_sample": True,
	}

	output = pipe(messages, **generation_args)
	print(f'Hallucination: {output[0]['generated_text'].strip().lower()}')
	# Hallucination: yes
	```

	### Comparison with Other Models

	We compared our merged model's performance on the hallucination detection benchmark against several other state-of-the-art language models:

	\| Model \| Precision \| Recall \| F1 \|
	\|---------------------- \|----------:\|-------:\|-------:\|
	\| Our Merged Model \| 0.75 \| 0.87 \| 0.81 \|
	\| GPT-4 \| 0.93 \| 0.72 \| 0.82 \|
	\| GPT-4 Turbo \| 0.97 \| 0.70 \| 0.81 \|
	\| Gemini Pro \| 0.89 \| 0.53 \| 0.67 \|
	\| GPT-3.5 \| 0.89 \| 0.65 \| 0.75 \|
	\| GPT-3.5-turbo-instruct\| 0.89 \| 0.80 \| 0.84 \|
	\| Palm 2 (Text Bison) \| 1.00 \| 0.44 \| 0.61 \|
	\| Claude V2 \| 0.80 \| 0.95 \| 0.87 \|

	As shown in the table, our merged model achieves one of the highest F1 scores of 0.81, outperforming several other state-of-the-art language models on this hallucination detection task.

	We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.

	Citations:
	Scores from arize/phoenix

	### Training Data

	@misc{HaluEval,
	author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian-Yun Nie and Ji-Rong Wen },
	title = {HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models},
	year = {2023},
	journal={arXiv preprint arXiv:2305.11747},
	url={https://arxiv.org/abs/2305.11747}
	}

	### Framework versions

	- PEFT 0.11.1
	- Transformers 4.41.2
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.2
	- Tokenizers 0.19.1

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0001
	- train_batch_size: 2
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 8
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 10
	- training_steps: 150

	### Framework versions

	- PEFT 0.11.1
	- Transformers 4.41.2
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.2
	- Tokenizers 0.19.1