Running on HF Inference Endpoint.

#2
by dev12br - opened

I managed to run the original model on an inference endpoint, but it uses a lot of vram, ending up requiring a very expensive instance, so i was trying to run the quantized version instead, with no luck. Do you know how could i do that?

Sorry, I don't have a clue. I've never used HF's cloud compute and do everything on my own hardware. Not sure what quant formats they can handle.

They don't have support for exl2 out of the box apparently. From what i'm understanding here, i'm going to have to create my own handler.

Sign up or log in to comment