metadata
library_name: transformers
tags:
- 5bit
- llama
- llama-2
- facebook
- meta
- 7b
- quantized
- ExLlamaV2
- quantized
- exl2
- 5.0-bpw
license: llama2
pipeline_tag: text-generation
Model Card for alokabhishek/Llama-2-7b-chat-hf-5.0-bpw-exl2
This repo contains 5-bit quantized (using ExLlamaV2) model of Meta's meta-llama/Llama-2-7b-chat-hf
Model Details
- Model creator: Meta
- Original model: Llama-2-7b-chat-hf
About quantization using ExLlamaV2
- ExLlamaV2 github repo: ExLlamaV2 github repo
How to Get Started with the Model
Use the code below to get started with the model.
How to run from Python code
First install the package
# Install ExLLamaV2
!git clone https://github.com/turboderp/exllamav2
!pip install -e exllamav2
Import
from huggingface_hub import login, HfApi, create_repo
from torch import bfloat16
import locale
import torch
import os
set up variables
# Define the model ID for the desired model
model_id = "alokabhishek/Llama-2-7b-chat-hf-5.0-bpw-exl2"
BPW = 5.0
# define variables
model_name = model_id.split("/")[-1]
Download the quantized model
!git-lfs install
# download the model to loacl directory
!git clone https://{username}:{HF_TOKEN}@huggingface.co/{model_id} {model_name}
Run Inference on quantized model using
# Run model
!python exllamav2/test_inference.py -m {model_name}/ -p "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."
import sys, os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from exllamav2 import (
ExLlamaV2,
ExLlamaV2Config,
ExLlamaV2Cache,
ExLlamaV2Tokenizer,
)
from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler
import time
# Initialize model and cache
model_directory = "/model_path/Llama-2-7b-chat-hf-5.0-bpw-exl2/"
print("Loading model: " + model_directory)
config = ExLlamaV2Config(model_directory)
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)
# Initialize generator
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
# Generate some text
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.01
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
prompt = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."
max_new_tokens = 512
generator.warmup()
time_begin = time.time()
output = generator.generate_simple(prompt, settings, max_new_tokens, seed=1234)
time_end = time.time()
time_total = time_end - time_begin
print(output)
Uses
Direct Use
[More Information Needed]
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]