metadata

license: llama3.1
base_model:
  - meta-llama/Meta-Llama-3.1-8B-Instruct
tags:
  - Text Generation
  - llama3.1
  - text-generation-inference
  - Inference Endpoints
  - Transformers
  - Fusion
language:
  - en

Llama-3.1-8B-Fusion-8020

Overview

Llama-3.1-8B-Fusion-8020 is a mixed model that combines the strengths of two powerful Llama-based models: arcee-ai/Llama-3.1-SuperNova-Lite and mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated. The weights are blended in a 8:2 ratio, with 80% of the weights from SuperNova-Lite and 20% from the abliterated Meta-Llama-3.1-8B-Instruct model. Although it's a simple mix, the model is usable, and no gibberish has appeared. This is an experiment. Later, I will test the 9:1, 7:3, 6:4, and 5:5 ratios separately to see how much impact they have on the model.

Model Details

Base Models:
- arcee-ai/Llama-3.1-SuperNova-Lite (80%)
- mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated (20%)
Model Size: 8B parameters
Architecture: Llama 3.1
Mixing Ratio: 9:1 (SuperNova-Lite:Meta-Llama-3.1-8B-Instruct-abliterated)

Key Features

SuperNova-Lite Contributions (80%): Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture.
Meta-Llama-3.1-8B-Instruct-abliterated Contributions (20%): This is an uncensored version of Llama 3.1 8B Instruct created with abliteration.

Usage

You can use this mixed model in your applications by loading it with Hugging Face's transformers library:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-8020"

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(mixed_model_name)

# Ensure the tokenizer has pad_token_id set
tokenizer.pad_token_id = tokenizer.eos_token_id

# Input loop
print("Start inputting text for inference (type 'exit' to quit)")
while True:
    prompt = input("Enter your prompt: ")
    if prompt.lower() == "exit":
        print("Exiting inference loop.")
        break

    # Inference phase: Generate text using the modified model
    chat = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    # Prepare input data
    input_ids = tokenizer.apply_chat_template(
        chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(device)

    # Use TextStreamer for streaming output
    streamer = TextStreamer(tokenizer, skip_special_tokens=True)

    # Record the start time
    start_time = time.time()

    # Generate text and stream output character by character
    outputs = mixed_model.generate(
        input_ids,
        max_new_tokens=8192,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        streamer=streamer  # Enable streaming output
    )

    # Record the end time
    end_time = time.time()

    # Calculate the number of generated tokens
    generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0]

    # Calculate the total time taken
    total_time = end_time - start_time

    # Calculate tokens generated per second
    tokens_per_second = generated_tokens / total_time

    print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")

Evaluations

We will be submitting this model to the OpenLLM Leaderboard for a more conclusive benchmark - but here are our internal benchmarks using the main branch of lm evaluation harness: