File size: 4,143 Bytes
aad0fd7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8e04ed
aad0fd7
2615679
aad0fd7
 
 
e16e93c
 
aad0fd7
 
 
 
 
e16e93c
 
aad0fd7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
license: llama3.1
base_model:
- meta-llama/Meta-Llama-3.1-8B-Instruct
tags:
- Text Generation
- llama3.1
- text-generation-inference
- Inference Endpoints
- Transformers
- Fusion
language:
- en
---
# Llama-3.1-8B-Fusion-8020

## Overview
`Llama-3.1-8B-Fusion-8020` is a mixed model that combines the strengths of two powerful Llama-based models: [arcee-ai/Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) and [mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated). The weights are blended in a 8:2 ratio, with 80% of the weights from SuperNova-Lite and 20% from the abliterated Meta-Llama-3.1-8B-Instruct model.
**Although it's a simple mix, the model is usable, and no gibberish has appeared**.
This is an experiment. Later, I will test the [9:1](https://huggingface.co/huihui-ai/Llama-3.1-8B-Fusion-9010), 7:3, 6:4, and 5:5 ratios separately to see how much impact they have on the model.

## Model Details
- **Base Models:**
  - [arcee-ai/Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) (80%)
  - [mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated) (20%)
- **Model Size:** 8B parameters
- **Architecture:** Llama 3.1
- **Mixing Ratio:** 9:1 (SuperNova-Lite:Meta-Llama-3.1-8B-Instruct-abliterated)

## Key Features
- **SuperNova-Lite Contributions (80%):** Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture.
- **Meta-Llama-3.1-8B-Instruct-abliterated Contributions (20%):** This is an uncensored version of Llama 3.1 8B Instruct created with abliteration.

## Usage
You can use this mixed model in your applications by loading it with Hugging Face's `transformers` library:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time

mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-8020"

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(mixed_model_name)

# Ensure the tokenizer has pad_token_id set
tokenizer.pad_token_id = tokenizer.eos_token_id

# Input loop
print("Start inputting text for inference (type 'exit' to quit)")
while True:
    prompt = input("Enter your prompt: ")
    if prompt.lower() == "exit":
        print("Exiting inference loop.")
        break

    # Inference phase: Generate text using the modified model
    chat = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    # Prepare input data
    input_ids = tokenizer.apply_chat_template(
        chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(device)

    # Use TextStreamer for streaming output
    streamer = TextStreamer(tokenizer, skip_special_tokens=True)

    # Record the start time
    start_time = time.time()

    # Generate text and stream output character by character
    outputs = mixed_model.generate(
        input_ids,
        max_new_tokens=8192,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        streamer=streamer  # Enable streaming output
    )

    # Record the end time
    end_time = time.time()

    # Calculate the number of generated tokens
    generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0]

    # Calculate the total time taken
    total_time = end_time - start_time

    # Calculate tokens generated per second
    tokens_per_second = generated_tokens / total_time

    print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")

```
## Evaluations
We will be submitting this model to the OpenLLM Leaderboard for a more conclusive benchmark - but here are our internal benchmarks using the main branch of lm evaluation harness: