This is Llama2-22b by chargoddard in a couple of GGML formats. I have no idea what I'm doing so if something doesn't work as it should or not at all that's likely on me, not the models themselves.
A second model merge has been released and the GGML conversions for that can be found here.
While I haven't had any issues so far do note that the original repo states "Not intended for use as-is - this model is meant to serve as a base for further tuning".
Approximate VRAM requirements at 4K context:
MODEL | SIZE | VRAM |
q5_1 | 16.4GB | 21.5GB |
q4_K_M | 13.2GB | 18.3GB |
q3_K_M | 10.6GB | 16.1GB |
q2_K | 9.2GB | 14.5GB |
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.