Text Generation
Transformers
Safetensors
English
llama
causal-lm
text-generation-inference
4-bit precision
gptq

stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors not compatible with "standard" settings

#1
by vmajor - opened

When attempting to load stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors using the standard python code that I use to test all other GPTQ models I get this:

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for model.layers.0.self_attn.k_proj.qzeros: copying a param with shape torch.Size([40, 640]) from checkpoint, the shape in current model is torch.Size([1, 640]).
        size mismatch for model.layers.0.self_attn.k_proj.scales: copying a param with shape torch.Size([40, 5120]) from checkpoint, the shape in current model is torch.Size([1, 5120]).
...

First thing to do is double check the file downloaded OK - sha256sum it.

root@6d4bbc85231a:~/gptq-llama# sha256sum /workspace/stable-vicuna-13B-GPTQ/stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors
442d71b56bc16721d28aeb2d5e0ba07cf04bfb61cc7af47993d5f0a15133b520  /workspace/stable-vicuna-13B-GPTQ/stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors

I ran your single loop test code and it runs OK (bit of a response problem here, but no errors!):

root@6d4bbc85231a:~/gptq-llama# python do_gptq_inf.py /workspace/stable-vicuna-13B-GPTQ stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors --text "### Human: write a story about llamas
### Assistant:"
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:47<00:00,  3.97s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 12/12 [00:41<00:00,  3.46s/it]
Done.
Output: </s>### Human: write a story about llamas
### Assistant: Once upon a time, there lived a llama named Llama. He was a special creature, with a heart of gold and a love for all things equine. Llama was a horse with a dream, to ride across the land and be a star.
### He loved to sing and dance, with a twinkle in his eye.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He loved to ride and sing, with a twinkle in his eye.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
### Llama was a horse of a different color, with a heart of gold and a love for all.
### He was a horse of a different color, with a heart of gold and a love for all.
###

Code used to test:

import torch
import torch.nn as nn
import quant
from gptq import GPTQ
from utils import find_layers, DEV, set_seed, get_wikitext2, get_ptb, get_c4, get_ptb_new, get_c4_new, get_loaders
import transformers
from transformers import AutoTokenizer
import argparse
import warnings

# Suppress warnings from the specified modules
warnings.filterwarnings("ignore", module="safetensors")
warnings.filterwarnings("ignore", module="torch")

def get_llama(model):

    def skip(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = skip
    torch.nn.init.uniform_ = skip
    torch.nn.init.normal_ = skip
    from transformers import LlamaForCausalLM
    model = LlamaForCausalLM.from_pretrained(model, torch_dtype='auto')
    model.seqlen = 2048
    return model


def load_quant(model, checkpoint, wbits, groupsize=-1, fused_mlp=True, eval=True, warmup_autotune=True):
    from transformers import LlamaConfig, LlamaForCausalLM
    config = LlamaConfig.from_pretrained(model)

    def noop(*args, **kwargs):
        pass

    torch.nn.init.kaiming_uniform_ = noop
    torch.nn.init.uniform_ = noop
    torch.nn.init.normal_ = noop

    torch.set_default_dtype(torch.half)
    transformers.modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = LlamaForCausalLM(config)
    torch.set_default_dtype(torch.float)
    if eval:
        model = model.eval()
    layers = find_layers(model)
    for name in ['lm_head']:
        if name in layers:
            del layers[name]
    quant.make_quant_linear(model, layers, wbits, groupsize)

    del layers

    print('Loading model ...')
    if checkpoint.endswith('.safetensors'):
        from safetensors.torch import load_file as safe_load
        model.load_state_dict(safe_load(checkpoint), strict=False)
    else:
        model.load_state_dict(torch.load(checkpoint), strict=False)

    quant.make_quant_attn(model)
    if eval and fused_mlp:
        quant.make_fused_mlp(model)

    if warmup_autotune:
        quant.autotune_warmup_linear(model, transpose=not (eval))
        if eval and fused_mlp:
            quant.autotune_warmup_fused(model)
    model.seqlen = 2048
    print('Done.')

    return model

def run_llama_inference(
    model_path,
    wbits=4,
    groupsize=-1,
    load_path="",
    text="",
    min_length=10,
    max_length=1024,
    top_p=0.7,
    temperature=0.8,
    device=0,
):

    if load_path:
        model = load_quant(model_path, load_path, wbits, groupsize)
    else:
        model = get_llama(model_path)
        model.eval()

    model.to(DEV)
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
    input_ids = tokenizer.encode(text, return_tensors="pt").to(DEV)

    with torch.no_grad():
        generated_ids = model.generate(
            input_ids,
            do_sample=True,
            min_length=min_length,
            max_length=max_length,
            top_p=top_p,
            temperature=temperature,
        )
    return tokenizer.decode([el.item() for el in generated_ids[0]])

def main():
    parser = argparse.ArgumentParser(description="Summarize an article using Vicuna.")
    parser.add_argument('model_dir', type=str, help='The model dir to load from')
    parser.add_argument('model_file', type=str, help='The model file to load')
    parser.add_argument('--text', type=str, required=True, help='The text to summarize.')
    parser.add_argument('--wbits', type=int, default=4, help='The model file to load')
    parser.add_argument('--groupsize', type=int, default=128, help='The model file to load')
    args = parser.parse_args()

    output = run_llama_inference(
        args.model_dir,
        wbits=args.wbits,
        groupsize=args.groupsize,
        load_path=f"{args.model_dir}/{args.model_file}",
        text=args.text,
    )

    with open("output.txt", "a", encoding="utf-8") as f:
        f.write(f"{args.text}\n{output}\n")

    print(f"Output: {output}")

if __name__ == "__main__":
    main()

Yea... sorry for the noise. It indeed works. The issue is that I have several test mules and this one was set to groupsize = -1 at invocation. This model wants groupsize=128 otherwise it fails with the error I reported.

But there is indeed something wrong with it or with the way we are interfacing with it because the output is not useful. Do you think something happened to it during merge? Even when I change the input section to this and just press enter to default to the preset I get a bad result. It does not infer. To me it looks like a reconstructed tokenizer output, rather than model inference result, and like in your example, it also repeats. I got something similar from the Alpacino13b 4bit.safetensors model - it just repeated the input text as output. Do you know what may be causing this?

prompt = """\
    ### Human: Please provide a concise summary of the following news article, capturing the key information and stating company ticker symbols, and government entity abbreviations, whenever possible: Credit Suisse logged asset outflows of more than $68 billion during first-quarter collapse. Credit Suisse on Monday revealed that it suffered net asset outflows of 61.2 billion Swiss francs ($68.6 billion) during the first-quarter collapse that culminated in its emergency rescue by domestic rival UBS. The stricken Swiss lender posted a one-off 12.43 billion Swiss franc profit for the first quarter of 2023, due to the controversial write-off of 15 billion Swiss francs of AT1 bonds by the Swiss regulator as part of the deal. The adjusted pre-tax loss for the quarter came in at 1.3 billion Swiss francs. Swiss authorities brokered the controversial 3 billion Swiss franc rescue over the course of a weekend in late March, following a collapse in Credit Suisse's deposits and share price amid fears of a global banking crisis triggered by the fall of U.S. lender Silicon Valley Bank. In Monday's earnings report, which could be the last in its 167-year history, Credit Suisse said it experienced significant net asset outflows, particularly in the second half of March 2023, which have "moderated but have not yet reversed as of April 24, 2023." First-quarter net outflows totaled 61.2 billion, 5% of the group's assets under management as of the end of 2022. Deposit outflows represented 57% of the net asset outflows from Credit Suisse's wealth management unit and Swiss bank for the quarter.
    ### Assistant:\
    """
    text = input(f"Enter input text (default prompt will be used if left empty): ")
    if not text:
        text = prompt

Output:

Swiss bank Credit Suisse on Monday revealed that it had suffered net asset outflows of more than $68 billion during the first three months of 2023.

The controversial write-off of 15 billion Swiss francs of AT1 bonds by the Swiss regulator as part of the deal was the primary reason for the bank's collapse, which has left it with a net asset outflow of more than $68 billion.

Credit Suisse's share price has been on a rollercoaster since the start of the year, falling from more than $50 to below $20 in just a few months.

The bank's wealth management unit, which manages the assets of more than 1 million clients, posted a one-off 12.43 billion Swiss franc profit for the first quarter of 2023, due to the controversial write-off of 15 billion Swiss francs of AT1 bonds by the Swiss regulator as part of the deal.

Credit Suisse on Monday revealed that it had experienced significant net asset outflows, particularly in the second half of March 2023, which totalled 61.2 billion Swiss francs, or 5% of the group's assets under management as of the end of 2022.

The bank's collapse has left it with a net asset outflow of more than $68 billion, which it has been trying to manage since the start of the year.

Credit Suisse's share price has been on a rollercoaster since the start of the year, falling from more than $50 to below $20 in just a few months.

The bank's wealth management unit, which manages the assets of more than 1 million clients, posted a one-off 12.43 billion Swiss franc profit for the first quarter of 2023, due to the controversial write-off of 15 billion Swiss francs of AT1 bonds by the Swiss regulator as part of the deal.

Credit Suisse on Monday revealed that it had experienced significant net asset outflows, particularly in the second half of March 2023, which totalled 61.2 billion Swiss francs, or 5% of the group's assets under management as of the end of 2022.

The bank's share price has been on a rollercoaster since the start of the year, falling from more than $50 to below $20 in just a few months.

Credit Suisse on Monday revealed that it had experienced significant net asset outflows, particularly in the second half of March 2023, which totalled 61.2 billion Swiss francs, or 5%
Credit Suisse reported a massive $68 billion in asset outflows during Q1 2023, resulting in a one-time gain of $12.43 billion due to the write-off of AT1 bonds by the Swiss regulator. Despite this, the company still recorded a pre-tax loss of $1.3 billion for the quarter. Net outflows were primarily driven by withdrawals from the wealth management and Swiss bank units, with deposit outflows representing 57% of these losses. These outflows have moderated slightly since the beginning of April, but have not fully reversed yet.

Company Ticker Symbols:

Credit Suisse - CSGNY (New York Stock Exchange)

UBS - UBSG (Swiss Exchange)

Government Entity Abbreviations:

Switzerland - CHE (ISO 3166-1 alpha-3 code)

I seem to get good results in text-generation-webui. Maybe it is only your parameters. The parameters preset used for the above output is labeled Llama-Precise:

temp: 0.7
top_p: 0.1
top_k: 40
typical_p: 1
repetition_penalty: 1.18
encoder_reptition_penalty: 1
no_repeat_ngram_size: 0
min_length: 0

Interesting, and that is indeed a great result. I now need to look at their source to see how they are invoking the model and passing the parameters. I passed the same parameters to my implementation and got a nonsensical result. I wonder if they are using generate() from transformers or something else.

I believe text-generation-webui includes "This is a conversation with your Assistant. The Assistant is very helpful and is eager to chat with you and answer your questions." as part of the context before the prompt. This might have some effect.
When prompted without that context, my responses weren't nearly as coherent.

I just tried it with several different input templates. The "Alpaca" template does not produce any output, "Alpaca-with-input" generates an output but it is low quality, not the amazing one that you got. Could you tell me exactly what prompt you used, in its entirety? Also, the exact model. I am finding this process of deploying self hosted LLMs like the proverbial cat herding.

Ok I got it to work, thank you for your hint regarding webui. I looked at their code and there is nothing magical that they are doing to the input, or the output besides making it better presented for the webui gradio output. The key was indeed the prompt structure and webui makes it easy to test which prompt format works best for which model. stable-vicuna is doing really well for my needs with a fixed seed that I found:

Response:

Assistant: Companies Mentioned: Credit Suisse (CS), UBS (UBSGY). Government Entities Mentioned: Swiss Regulators. Summary: Credit Suisse reported a one-time gain of CHF 12.43 billion due to the write-off of CHF 15 billion worth of AT1 bonds by the Swiss regulators as part of the rescue package. Net asset outflows amounted to CHF 61.2 billion, primarily driven by withdrawals from the wealth management and Swiss bank units. Despite efforts to moderate the outflows, they have not fully reversed as of April 24th, 2023.

as there is no delete. Still looking into this.

Hello vmajor

I am trying to do the exact same thing as you but reading and looping through my email. I also used https://github.com/PanQiWei/AutoGPTQ in order to load the quantized model.

Two question please:

  • How were you able to pass the --instruct argument and Vicuna-V0 instruction to the model through python?
  • How much Vram does your model consume after loading and during inference?

Many thanks.

There are two entire programs that I shared above, take a look. They both run, but the problem is that the results are unstable (literally unstable with stable-vicuna because the model quits seemingly randomly) because the inference may work once or three times, but fail intermittently when you least expect it. As I suggested in the other thread, work with other models, not 13b GPTQ. If you keep trying with 13b GPTQ and get them to work reliably I'd love to hear how you did it. I could not even get them to work reliably with webui, so it was not my code that was causing issues.

EDIT: regarding VRAM, 13b GPTQ models occupy a bit over 10GB of VRAM.
EDIT 2: Sorry, I now see that this is a different thread. Try The Bloke's version of the code. That one has a better prompt than my original code and has additional print statement

@vmajor About VRAM indeed it uses 10Gb when loaded but whenever I send a request for a prompt or chunks of a prompt the memory usage skyrockets. I don't know if you had this kind of behavior before.

Did you try the --instruct argument and Vicuna-V0 parameter in the webui?

The only time I saw the ram spike uncontrollably is when I use native HF models with transformers, and when using that beam search thing. I do not remember the exact wording because I do not have webui running right now. I do not recall trying the --instruct or Vicuna-V0 parameters in webui.

You have to try --instruct and Vicuna-V0 this changes everything when running it on webui. No more giberrish generation. Much more stable

I cannot see these flags available here: https://github.com/oobabooga/text-generation-webui

Where did you set them?

Hello @vmajor would you mind sharing the code your using with the 65B model? thanks

OK sure, you can use this as a basic example for instruction based inference.

import argparse
from llama_cpp import Llama

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="/home/vmajor/models/alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin")
args = parser.parse_args()

llm = Llama(model_path=args.model, n_threads=12, n_ctx=2048, seed=1000, n_batch=128, last_n_tokens_size=150)

context = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n"

instruction = "### Instruction: Please provide a concise summary of the following news article, capturing the key information and stating company ticker symbols, and government entity abbreviations, whenever possible: \n"

input = """### Input: Major Wall Street firm sees a breakout in luxury stocks β€” and lists three reasons why ETFs are a great way to play it. As luxury stocks make waves overseas, State Street Global Advisors believes investors should consider European ETFs if they want to capture the gains from their outperformance. Matt Bartolini, the firm's head of SPDR Americas research, finds three reasons why the backdrop is becoming particularly attractive. First and second on his list: valuations and earnings upgrades. "That's completely different than what we saw for U.S. firms," he told CNBC's Bob Pisani on "ETF Edge" this week. His remarks come as LVMH became the first European company to surpass $500 billion in market value earlier this week. Bartolini lists price momentum as a third driver of the investor shift. \n"""

output = "### Response: \n"

print(context + instruction + input + output)
summary_output = llm(
    context + instruction + input + output,
    max_tokens=1024,
    stop=None,
    temperature=1.0,
    repeat_penalty=1.1,
    top_k=160,
    top_p=0.5,
    echo=True,
)

summary_text = summary_output["choices"][0]["text"]
# Split the output by the keyword Answer:
parts = summary_text.split("### Response:")

# Check if there are two parts after splitting
if len(parts) == 2:
    # Get the second part which contains the answer
    answer = parts[1]

    # Strip any leading or trailing whitespace from the answer
    answer = answer.strip()

    # Save the answer to a file
    with open("summary.txt", "w") as f:
        f.write(answer)
else:
    # Print an error message if there is no answer in the output
    print("No answer found in the output.")

Thank I will try it with the 30B model asap. If you are curious this is the code that I was trying to implement a cleaning and summary for financial emails. Not working of course and I don't know how to leverage my 4090 with any other model.

import pandas as pd
from msal import PublicClientApplication
from bs4 import BeautifulSoup
import re
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch

#Loading summurizer
print("Loading summarizer...")
summarizer = pipeline("summarization")

#Emplacement du modèle sur le disque dur
quantized_model_dir = "D:\Clone\TheBloke stable vicuna 13B GPTQ"


#Mise en place du tokenizer pour envoyer les requetes au modele
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

# Global variable to store the loaded model
loaded_model = None

def load_gpt_model():
    global loaded_model
    if loaded_model is None:
        loaded_model = get_model("stable-vicuna-13B-GPTQ-4bit", triton=False, model_has_desc_act=False)
    return loaded_model

#configuration du modèle GPTQ
def get_config(has_desc_act):
    )
#fonction pour récupèrer un certain type de modèle selon la fin du fichier
def get_model(model_base, triton, model_has_desc_act):
    if model_has_desc_act:
        model_suffix="latest.act-order"
    else:
        model_suffix="compat.no-act-order"
    return AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_safetensors=True, model_basename=f"{model_base}.{model_suffix}", device="cuda:0", use_triton=triton, quantize_config=get_config(model_has_desc_act))

# Prevent printing spurious transformers error
logging.set_verbosity(logging.CRITICAL)
#enclenchement de la function pour récupérer le modèle
print("Loading GPTQ model...")
model = load_gpt_model()
#Sensibilité du modèle
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)


# Functions to clean text and extract paragraphs

def extract_text(html_content):
    soup = BeautifulSoup(html_content, "html.parser")
    text_parts = []

    for tag in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li']):
        text_parts.append(tag.get_text(separator=' '))

    return ' '.join(text_parts).strip()

def clean_email_text(text):
    # Remove email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)

    # Remove phone numbers
    text = re.sub(r'\+?\d[\d -]{7,}\d', '', text)

    # Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

    # Remove "unsubscribe" and "disclaimer" sections
    text = re.sub(r'((Un)?subscribe|Disclaimer)[\s\S]*', '', text, flags=re.IGNORECASE)

    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text


##################################################################################################
##################################################################################################
##########################Clean with VICUNA 13B ##################################################
##################################################################################################
##################################################################################################
def clean_text_with_gptq_model(cleaned_relevant_text, max_tokens=2048):
    # Adjust the number to reserve tokens for the prompt
    reserved_tokens = 1000
    max_chunk_tokens = max_tokens - reserved_tokens

    # Tokenize the text using the model's tokenizer
    tokens = tokenizer.encode_plus(
        cleaned_relevant_text,
        max_length=max_chunk_tokens,
        return_overflowing_tokens=True,
        truncation=True,
        padding='max_length',
        stride=0
    )

    chunks = [tokens['input_ids']]
    if 'overflowing_tokens' in tokens:
        # Convert overflowing tokens to lists of integers
        chunks.extend(tokens['overflowing_tokens'])

    cleaned_text_chunks = []
    for chunk in chunks:
        # Flatten the list of lists into a single list of integers
        flat_chunk = [item for sublist in chunk for item in sublist]
        decoded_chunk = tokenizer.decode(flat_chunk, skip_special_tokens=True)
        prompt = f"Please remove disclaimers and any irrelevant information from the following text:\n\n{decoded_chunk}\n\nCleaned text:"
        
        generated_text = pipe(prompt)[0]['generated_text']
        response = generated_text.split("Cleaned text:")[-1].strip()
        
        cleaned_text_chunks.append(response)

        # Free up GPU memory
        torch.cuda.empty_cache()

    return " ".join(cleaned_text_chunks)

##################################################################################################
###############Grabbing messages and looping them cleaning and sending mail#######################
##################################################################################################
##################################################################################################

if response.status_code == 200:
    messages = response.json()["value"]
    for message in messages:
        subject = message["subject"]
        content = message["body"]["content"]

        # 1 Extract text from the email content, utilise beautifulsoup, fournit le message qui est en html et lui le reprend en texte
        text = extract_text(content)
        stock_found = None
        relevant_text = None

        # 2 Search for stock names in the text if found it's then a stock found and text is relevant 
        for stock_name in stock_names:
            if stock_name.lower() in text.lower():
                stock_found = stock_name
                relevant_text = text
                break

        if stock_found:
            print(f"Found text mentioning {stock_found}: {relevant_text}")

            # 3 Clean the relevant text using BeautifulSoup 
            cleaned_relevant_text = clean_email_text(relevant_text)
            print("Cleaned relevant text using BeautifulSoup.")
            print(f"Cleaned text length: {len(cleaned_relevant_text)}")
            # 4 Send the cleaned relevant text for further cleaning
            cleaned_relevant_text = clean_text_with_gptq_model(cleaned_relevant_text)
            print("Cleaned relevant text using GPTQ model.")
            
            try:
                summarized_text = summarizer(cleaned_relevant_text, max_length=600, min_length=10, do_sample=False)[0]["summary_text"]
                summary = summary.append({"Stock": stock_found, "Subject": subject, "Summary": summarized_text}, ignore_index=True)
                print(f"Summarized text for {stock_found}.")
            except IndexError:
                print(f"Error summarizing the text mentioning {stock_found}.")

    # Send email with the summary
    if not summary.empty:
        send_email(summary, access_token)
    else:
        print("No emails mentioning stocks found")

else:
    print(f"Error fetching messages: {response.status_code}, {response.text}")

OK I looked at the code just quickly, and have a few observations:

  1. Regex did not work for me at all for symbol lookup. In fact the only thing that worked was the Alpaca 65B, most of the time, and GPT-3.5 and GPT-4 every time. So if you are already using API calls to GPT-3.5 asking it to reverse lookup symbols would be money well spent.

  2. Make sure that your prompt is constructed exactly as per the original model documentation. These smaller models, and llama in particular are extremely sensitive to how you format the prompt. Including the context, ### Instruction: ### Input: and ### Response: are not optional if you want a consistent response.

  3. Buy more RAM and don't waste time on smaller models...or do and then share your success story. It would be groundbreaking and I for one would love to read about how you made the sub 65B model work for your use case (similar to my use case), consistently and often enough to be genuinely usable.

...and I just got an email from OpenAI telling me that they (finally) granted me GPT-4 API access, yey :)

Congratulation! on Api access but I would guess that the api call cost will be tremendous :D.

For my use case regex is looking for names or part of stock names and it works quite well actually, if there's any email where a specific stock is mentionned take the whole email clean it and feed it to the LLM.

I will buy more ram but I dont think my 5800x3D will be enough for inference. I dont know.

Is there a forum or discord for noobs like me to discuss these topics? It was a pleasure talking to you btw.

OK, maybe I will take a closer look at you regex and see if I can use it. Your CPU will work if you can increase your system RAM. I have 128 GB and it is a decent balance between the largest models that I want to work with given the size and speed of inference. Meaning, if I get more RAM I could use larger models but I do not want to wait a day for the inference to finish.

Forum...not sure. I usually follow reddit and the respositories on GitHub. Since you are using GPTQ you may get good answers there.

There's quite a number of Discords for discussing this stuff. I hang out on several. Here you go:

  • Alpaca Lora (lots of discussion of fine tuning and training especially, and inference and coding too): https://discord.gg/ZMHkCGy9
  • LmSys (people who released Vicuna. Discussion on inference and fine tuning/training): https://discord.gg/CPz84krv
  • GPT4ALL (a company called Nomic, who release models and have a simple local UI for inference. Not much discussion of fine tuning, some coding talk, lots of inference talk): https://discord.gg/sfWUbDKH

Sorry to necro this thread a bit, but is the prompting style below with ### Instruct:/### Response:, something that will rectify Stable Vicuna conversing with itself, and can it be used in a langchain prompt template? I'm having the toughest time stopping the model doing that with ### Human:/### Assistant:, no matter if I'm using custom_stopping_strings, or eos_token_id's, etc. Nothing stops it. I end up stripping the rest of the conversation out, and it extends inference time. Any guidance would be appreciated!

OK sure, you can use this as a basic example for instruction based inference.

import argparse
from llama_cpp import Llama

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="/home/vmajor/models/alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin")
args = parser.parse_args()

llm = Llama(model_path=args.model, n_threads=12, n_ctx=2048, seed=1000, n_batch=128, last_n_tokens_size=150)

context = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n"

instruction = "### Instruction: Please provide a concise summary of the following news article, capturing the key information and stating company ticker symbols, and government entity abbreviations, whenever possible: \n"

input = """### Input: Major Wall Street firm sees a breakout in luxury stocks β€” and lists three reasons why ETFs are a great way to play it. As luxury stocks make waves overseas, State Street Global Advisors believes investors should consider European ETFs if they want to capture the gains from their outperformance. Matt Bartolini, the firm's head of SPDR Americas research, finds three reasons why the backdrop is becoming particularly attractive. First and second on his list: valuations and earnings upgrades. "That's completely different than what we saw for U.S. firms," he told CNBC's Bob Pisani on "ETF Edge" this week. His remarks come as LVMH became the first European company to surpass $500 billion in market value earlier this week. Bartolini lists price momentum as a third driver of the investor shift. \n"""

output = "### Response: \n"

print(context + instruction + input + output)
summary_output = llm(
    context + instruction + input + output,
    max_tokens=1024,
    stop=None,
    temperature=1.0,
    repeat_penalty=1.1,
    top_k=160,
    top_p=0.5,
    echo=True,
)

summary_text = summary_output["choices"][0]["text"]
# Split the output by the keyword Answer:
parts = summary_text.split("### Response:")

# Check if there are two parts after splitting
if len(parts) == 2:
    # Get the second part which contains the answer
    answer = parts[1]

    # Strip any leading or trailing whitespace from the answer
    answer = answer.strip()

Can't help with stable vicuna but that prompt method works with all llama models I tried.

I should note that no prompt worked reliably with smaller models. The smallest model that I used and is reliable is the new 30B Wizard.

Ok, thanks for the clarification! The model works quite well in the text-generation-ui with the Vicuna0 template in Instruct mode. I just couldn't find out exactly how the ui was feeding that in to generation.

Sign up or log in to comment