Can you provide quantized models?
Some users have done so already for certain models but most (including this one) aren't yet quantized on Hugging Face. In addition, quantizing on user hardware isn't a straightforward process and even users with higher end hardware can struggle with OOM issues if it's not top tier. As this is quickly becoming the primary way users are interacting with your models, I think it would be advantageous to post these yourself instead of leaving it up to users to figure out on their own and potentially, eventually, redistribute.
I'm aware Hugging Face is developer centric, and thus many will know, or be able to figure out, how to quantize models, and can use Google Collab to do so. But even if that's what the expectation is, there aren't easily accessible Collab notebooks with scripts to ease this process, requiring users to setup and configure the notebook for this purpose.
For the end-user looking to run LLMs locally, the process of getting new text generation models is full of friction and confusion, and I don't think it has to be this way. It would set a good precedent for one of the most user-centric LLM and API developers to provide a user-centric experience.
We are currently in the process of adding 4-bit support to KoboldAI and part of that requires some changes to how the format works currently. Once ready we will begin providing quantized models for sure.
Awesome, can't wait!