Stupid question: How to exactly use this model to inference?
As title.
I've tried to load and run, but the vram consumption was not reduced.
Could anyone help me? Thanks a lot!
Unsloth doesn't reduced vram for inference. Only for training/fine-tuning. But we do make inference natively faster
Got it. Thanks
Then if I want to use this model to inference with huggingface, I just load it as I was using an official llama model? I don't need to do any special configurations?
And BTW, using unsloth models with HF will be faster, too?
Thanks for your reply!!
Got it. Thanks
Then if I want to use this model to inference with huggingface, I just load it as I was using an official llama model? I don't need to do any special configurations?
And BTW, using unsloth models with HF will be faster, too?Thanks for your reply!!
You will need to convert this model to GGUF to run it with hugging face. I would recommend using an already preuploaded GGUF of this model. Using unsloth models will only be faster because it is 4bit quantized but you will still need to convert it to GGUF in order to run it!
Got it.
Thanks a lot!!!