unsloth/llama-3-8b-bnb-4bit · Stupid question: How to exactly use this model to inference?

saharayang99

Aug 7

As title.

I've tried to load and run, but the vram consumption was not reduced.

Could anyone help me? Thanks a lot!

shimmyshimmer

Unsloth AI org Aug 9

Unsloth doesn't reduced vram for inference. Only for training/fine-tuning. But we do make inference natively faster

saharayang99

Aug 9

Got it. Thanks

Then if I want to use this model to inference with huggingface, I just load it as I was using an official llama model? I don't need to do any special configurations?
And BTW, using unsloth models with HF will be faster, too?

Thanks for your reply!!

shimmyshimmer

Unsloth AI org Aug 11

Got it. Thanks

Then if I want to use this model to inference with huggingface, I just load it as I was using an official llama model? I don't need to do any special configurations?
And BTW, using unsloth models with HF will be faster, too?

Thanks for your reply!!

You will need to convert this model to GGUF to run it with hugging face. I would recommend using an already preuploaded GGUF of this model. Using unsloth models will only be faster because it is 4bit quantized but you will still need to convert it to GGUF in order to run it!

saharayang99

Aug 12

Got it.
Thanks a lot!!!

saharayang99 changed discussion status to closed Aug 12