TheBloke/WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ

Just to help other folks who happen to stumble on this model, I had some severe performance issues until I tweaked my settings using text-generation-webui (oobabooga, windows). I have a RTX 2070 with 8GB VRAM and was getting as low as 1-2 tokens/s for a while until I changed my load settings. Now I am able to get a consistent 10+ tokens/s and sometimes even 20+t/s, and performance remains stable in long sessions. All I did was to load the model using ExLlama, with max_seq_len 4096 and compress_pos_emb 2. Generation starts very fast and there's hardly any delay, and it's very usable now.

Hope this helps others who are struggling with performance.

TheBloke
/

WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ

Performance tweaks