Stupid question: How to exactly use this model to inference?

#12
by saharayang99 - opened

As title.

I've tried to load and run, but the vram consumption was not reduced.

Could anyone help me? Thanks a lot!

Unsloth AI org

Unsloth doesn't reduced vram for inference. Only for training/fine-tuning. But we do make inference natively faster

Got it. Thanks

Then if I want to use this model to inference with huggingface, I just load it as I was using an official llama model? I don't need to do any special configurations?
And BTW, using unsloth models with HF will be faster, too?

Thanks for your reply!!

Unsloth AI org

Got it. Thanks

Then if I want to use this model to inference with huggingface, I just load it as I was using an official llama model? I don't need to do any special configurations?
And BTW, using unsloth models with HF will be faster, too?

Thanks for your reply!!

You will need to convert this model to GGUF to run it with hugging face. I would recommend using an already preuploaded GGUF of this model. Using unsloth models will only be faster because it is 4bit quantized but you will still need to convert it to GGUF in order to run it!

Got it.
Thanks a lot!!!

saharayang99 changed discussion status to closed

Sign up or log in to comment