Update README.md

c08ea8c verified about 2 months ago

6 kB

	---
	license: llama3.1
	base_model:
	- meta-llama/Meta-Llama-3.1-8B-Instruct
	tags:
	- Text Generation
	- llama3.1
	- text-generation-inference
	- Inference Endpoints
	- Transformers
	- Fusion
	language:
	- en
	---
	# Llama-3.1-8B-Fusion-5050

	## Overview
	`Llama-3.1-8B-Fusion-5050` is a mixed model that combines the strengths of two powerful Llama-based models: [arcee-ai/Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) and [mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated).
	The weights are blended in a 5:5 ratio, with 50% of the weights from SuperNova-Lite and 50% from the abliterated Meta-Llama-3.1-8B-Instruct model.
	Although it's a simple mix, the model is usable, and no gibberish has appeared.
	This is an experiment. I test the [9:1](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-9010), [8:2](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-8020), [7:3](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-7030), [6:4](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-6040) and [5:5](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-5050) ratios separately to see how much impact they have on the model.
	All model evaluation reports will be provided subsequently.

	## Model Details
	- Base Models:
	- [arcee-ai/Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) (50%)
	- [mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated) (50%)
	- Model Size: 8B parameters
	- Architecture: Llama 3.1
	- Mixing Ratio: 5:5 (SuperNova-Lite:Meta-Llama-3.1-8B-Instruct-abliterated)

	## Key Features
	- SuperNova-Lite Contributions (50%): Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture.
	- Meta-Llama-3.1-8B-Instruct-abliterated Contributions (50%): This is an uncensored version of Llama 3.1 8B Instruct created with abliteration.

	## Usage
	You can use this mixed model in your applications by loading it with Hugging Face's `transformers` library:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
	import time

	mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-5050"

	# Check if CUDA is available
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load model and tokenizer
	mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16)
	tokenizer = AutoTokenizer.from_pretrained(mixed_model_name)

	# Ensure the tokenizer has pad_token_id set
	tokenizer.pad_token_id = tokenizer.eos_token_id

	# Input loop
	print("Start inputting text for inference (type 'exit' to quit)")
	while True:
	prompt = input("Enter your prompt: ")
	if prompt.lower() == "exit":
	print("Exiting inference loop.")
	break

	# Inference phase: Generate text using the modified model
	chat = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": prompt}
	]

	# Prepare input data
	input_ids = tokenizer.apply_chat_template(
	chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
	).to(device)

	# Use TextStreamer for streaming output
	streamer = TextStreamer(tokenizer, skip_special_tokens=True)

	# Record the start time
	start_time = time.time()

	# Generate text and stream output character by character
	outputs = mixed_model.generate(
	input_ids,
	max_new_tokens=8192,
	do_sample=True,
	temperature=0.6,
	top_p=0.9,
	streamer=streamer # Enable streaming output
	)

	# Record the end time
	end_time = time.time()

	# Calculate the number of generated tokens
	generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0]

	# Calculate the total time taken
	total_time = end_time - start_time

	# Calculate tokens generated per second
	tokens_per_second = generated_tokens / total_time

	print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")

	```
	## Evaluations

	The following data has been re-evaluated and calculated as the average for each test.
	\| Benchmark \| SuperNova-Lite \| Meta-Llama-3.1-8B-Instruct-abliterated \| Llama-3.1-8B-Fusion-9010 \| Llama-3.1-8B-Fusion-8020 \| Llama-3.1-8B-Fusion-7030 \| Llama-3.1-8B-Fusion-6040 \| Llama-3.1-8B-Fusion-5050 \|
	\|-------------\|----------------\|----------------------------------------\|--------------------------\|--------------------------\|--------------------------\|--------------------------\|--------------------------\|
	\| IF_Eval \| 82.09 \| 76.29 \| 82.44 \| 82.93 \| 83.10 \| 82.94 \| 82.03 \|
	\| MMLU Pro \| 35.87 \| 33.1 \| 35.65 \| 35.32 \| 34.91 \| 34.5 \| 33.96 \|
	\| TruthfulQA \| 64.35 \| 53.25 \| 62.67 \| 61.04 \| 59.09 \| 57.8 \| 56.75 \|
	\| BBH \| 49.48 \| 44.87 \| 48.86 \| 48.47 \| 48.30 \| 48.19 \| 47.93 \|
	\| GPQA \| 31.98 \| 29.50 \| 32.25 \| 32.38 \| 32.61 \| 31.14 \| 30.6 \|

	The script used for evaluation can be found inside this repository under /eval.sh, or click [here](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-5050/blob/main/eval.sh)