I had to update my GPTQ manually to run and it's very slow
This did not run with the version of GPTQ that comes with Oobabooga and I had to clone and install the git for GPTQ. Letting the installer re-download and reinstall Oobabooga's fork of GPTQ didn't work, and I tried both current and old CUDA branch and it was about 1/3 of my typical 7B speeds. Trying to load it in Oobabooga will cause it to crash, have a pre-layer error, or give a CPU OOM error depending on the conditions.
I'm on a potato and normally get 0.49-0.59 tokens/s on Wizard but with this version, I was getting 0.21 tokens/s, so after confirming that the cache was active, I assume it's GPTQ. With the updated GPTQ, I couldn't run any other models I use so couldn't compare and be sure whether it was the model, but thought it was worth mentioning since the main branch is supposed to be higher-compatibility.
Sorry to hear you're having problems.
I did test it with ooba's CUDA and it worked fine for me. But I didn't test pre_layer, and yes I see that that doesn't work. And it sounds like you don't have enough VRAM to load it without pre_layer.
I will make another one with ooba's CUDA and test that and if it's good I'll upload that as well. I'll get back to you in an hour or two.
OK I've made another model, this time using ooba's CUDA. I've tested it and confirmed it works with pre_layer.
It's in separate branch oobaCUDA
: https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ/tree/oobaCUDA
Please test and let me know it works OK for you.
I unfortunately can't test if it fixed the issue now thanks to oobabooga breaking on me. Something with their installer is utterly broken, as trying to revert my GPTQ-for-LLaMa started giving me new issues that only got worse when trying to update my install, and a fresh install doesn't let me load any GPTQ models now. I created a bug report already, but don't know if it'll be fixed in a day or a week. I tried downgrading many packages but it just isn't fixing it. So no more LLMs for me until they fix it, unless I want to do an OS migration and try on Linux (which had other issues with my device last time I tried, otherwise it'd be my default)
I fixed the problems I had. It's working at same perf as Wizard standard for me. Thank you for uploading that!
You're welcome.
But are you saying that https://huggingface.co/TheBloke/wizardLM-7B-GPTQ doesn't have these problems for you? Because I made that the same way as I made the original files in this repo, so that'd be confusing
Yeah, I didn't have any problem with standard wizardLM, so a bit odd this one had the problem but the other one did not. Couldn't say why, I'm positive I downloaded "compat.the -no-act-orders" file.
Yeah I don't know what's going on there. I'd expect all the models I make to break on pre_layer atm, apart from the new file I made for you.
Regarding your install woes: you don't need to do an OS migration. You can install WSL2, which is Linux running under virtualisation. It supports Nvidia GPUs and CUDA. It's easy to install, and once it is installed, follow this guide: https://docs.nvidia.com/cuda/wsl-user-guide/index.html
Although that will currently recommend that you install 12.1, and you may not want to do that as there's no pre-built binaries for pytorch with CUDA 12.x. So personally I'm staying with CUDA 11.x, eg CUDA 11.7 or 11.8. Here's the WSL download link for CUDA 11.7.1 in WSL: https://developer.nvidia.com/cuda-11-7-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local