license: llama3.1
base_model:
- meta-llama/Meta-Llama-3.1-8B-Instruct
tags:
- Text Generation
- llama3.1
- text-generation-inference
- Inference Endpoints
- Transformers
- Fusion
language:
- en
Llama-3.1-8B-Fusion-8020
Overview
Llama-3.1-8B-Fusion-9010
is a mixed model that combines the strengths of two powerful Llama-based models: arcee-ai/Llama-3.1-SuperNova-Lite and mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated. The weights are blended in a 9:1 ratio, with 90% of the weights from SuperNova-Lite and 10% from the abliterated Meta-Llama-3.1-8B-Instruct model.
Although it's a simple mix, the model is usable, and no gibberish has appeared.
This is an experiment. Later, I will test the 9:1(https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-9010), 7:3, 6:4, and 5:5 ratios separately to see how much impact they have on the model.
Model Details
- Base Models:
- Model Size: 8B parameters
- Architecture: Llama 3.1
- Mixing Ratio: 9:1 (SuperNova-Lite:Meta-Llama-3.1-8B-Instruct-abliterated)
Key Features
- SuperNova-Lite Contributions (90%): Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture.
- Meta-Llama-3.1-8B-Instruct-abliterated Contributions (10%): This is an uncensored version of Llama 3.1 8B Instruct created with abliteration.
Usage
You can use this mixed model in your applications by loading it with Hugging Face's transformers
library:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time
mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-8020"
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and tokenizer
mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(mixed_model_name)
# Ensure the tokenizer has pad_token_id set
tokenizer.pad_token_id = tokenizer.eos_token_id
# Input loop
print("Start inputting text for inference (type 'exit' to quit)")
while True:
prompt = input("Enter your prompt: ")
if prompt.lower() == "exit":
print("Exiting inference loop.")
break
# Inference phase: Generate text using the modified model
chat = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
# Prepare input data
input_ids = tokenizer.apply_chat_template(
chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(device)
# Use TextStreamer for streaming output
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
# Record the start time
start_time = time.time()
# Generate text and stream output character by character
outputs = mixed_model.generate(
input_ids,
max_new_tokens=8192,
do_sample=True,
temperature=0.6,
top_p=0.9,
streamer=streamer # Enable streaming output
)
# Record the end time
end_time = time.time()
# Calculate the number of generated tokens
generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0]
# Calculate the total time taken
total_time = end_time - start_time
# Calculate tokens generated per second
tokens_per_second = generated_tokens / total_time
print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")
Evaluations
We will be submitting this model to the OpenLLM Leaderboard for a more conclusive benchmark - but here are our internal benchmarks using the main branch of lm evaluation harness: