I am trying to run the model locally but I am getting the exit code 3221225477.

#140
by namantjeaswi - opened

Is it due to the memory limitation on my system, I am able to load and run the model through llama cpp using the gguf file but I am failing to run it through the hugging face transformer library.

The error you got seems to be memory-related. A great tool to use is the Model Memory Calculator, you can provide the model ID, and select the different levels of model precision to see what the memory requirements are:

Precision Level Estimated Memory Requirement
float32 (Default) 27.49GB
float16 13.74GB
int8 6.87 GB
int4 3.44 GB

LLama.cpp seems to use 4-bit quantization often, so this allows the model which normally takes ~27.5GB of memory to run to only take ~3.5GB of memory. If you want to use 4bit as well, you can still use the transformers library but you can use bitsandbytes / accelerate to accomplish this.

If bitsandbytes / accelerate do not work for you, you can also try to use this model instead: unsloth/mistral-7b-v0.2-bnb-4bit (v0.1 is available too)

pip install -q -U bitsandbytes transformers peft accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
  "mistralai/Mistral-7B-v0.1",
  quantization_config=quantization_config,
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# ...

Edit: Updated the code provided, it was giving me some issues with that model class.

Hello

Thank you for your response, you are indeed correct , I was working with 16gb ram and default 32 bit precision model which is not sufficient. I also tried out bitsandbytes / accelerate and it works, I just want to know is there any benefit or loss if we use bitsandbytes / accelerate over using gguf models with llama cpp for same precision.

I would think with using bitsandbytes / accelerate I would be able access more models but we have few gguf models. I am trying to build a unified codebase where I can compare the performance different open source models in my rag pipeline and see if I can get close to gpt 4 and find an optimum point for inference time, size and response quality.

I would benchmark the two with your use case and see how it performs in terms of speed, cost, and model accuracy. It's tough to give a definite answer, as one may work better for me that doesn't work better for you.

I do agree that using bitsandbytes would be much more flexible, as you can adjust precision very quickly as well as use models that someone (or yourself) has not quantized and posted publicly. Best of luck with your project!

namantjeaswi changed discussion status to closed

Sign up or log in to comment