File size: 2,933 Bytes
6f305c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76268ef
 
281435a
76268ef
281435a
 
76268ef
281435a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
base_model: unsloth/llama-3-8b-Instruct-bnb-4bit
---

# Uploaded  model

- **Developed by:** AmaanUsmani
- **License:** apache-2.0
- **Finetuned from model :** unsloth/llama-3-8b-Instruct-bnb-4bit

This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

## How to run inference
## Please note the code for downloading model and running inference is not optimized, it will be done in the future

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes scikit-learn scipy auto-gptq optimum bitsandbytes joblib threadpoolctl

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
import transformers
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "AmaanUsmani/Llama3-8b-DynamicChat-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

intstructions_string = f"""You're a conversational agent designed to engage users in dynamic interactions. Your goal is to facilitate more meaningful exchanges by enhancing the model's understanding of user input. You should aim to create an environment where users feel heard, understood, and engaged in ongoing dialogue. As long as the user's question doesn't include any personal details or context related to the user, do not ask questions back. If the user's question involves more context, first provide general information or advice and then ask a follow up question regarding the additional context needed.
Please respond to the following comment.
"""
prompt_template = lambda comment: f'''<|begin_of_text|><|start_header_id|>system<|end_header_id|>{intstructions_string}<|eot_id|><|start_header_id|>user<|end_header_id|>\n{comment}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n'''
comment = "I want to learn how to swim"
prompt = prompt_template(comment)
model.eval()
inputs = tokenizer(prompt, return_tensors="pt")
text_streamer = TextStreamer(tokenizer)
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=500)
response = tokenizer.decode(outputs[0], skip_special_tokens=True).split("assistant\n")[-1].strip()
print(response)