Your converted model performs better, and I don't understand why
Hi @TheBloke ,
For some reason your model has slightly better PPL than any of the 4bit-128g versions I recently converted. I say "any" because I tried various combinations of GPTQ commits and transformers versions.
Some metrics for some versions I've tried:
Model | wikitext2 PPL | ptb-new PPL | c4-new PPL |
---|---|---|---|
4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g | 7.119165420532227 | 35.637290954589844 | 9.550592422485352 |
4bit-GPTQ - Thireus/Vicuna13B-v1.1-4bit-128g (GPTQ commit 508de42ff45ec560a4504e12b0d42114d599cf38) | 7.129854202270508 | 35.848060607910156 | 9.568032264709473 |
4bit-GPTQ - Thireus/Vicuna13B-v1.1-4bit-128g (GPTQ commit d89cdcd8b53f61346290a28d326816af6a028434) | 7.137491226196289 | 35.530372619628906 | 9.597953796386719 |
4bit-GPTQ - Thireus/Vicuna13B-v1.1-4bit-128g (GPTQ commit f3f7a6910fd6778548cdafe7f0d5155411b1696c) | 7.137701988220215 | 35.52903366088867 | 9.597844123840332 |
4bit-GPTQ - Thireus/Vicuna13B-v1.1-4bit-128g (GPTQ commit 49ffd9ab085004978a6bdc8e2dff7510f2458e71) | 7.137701988220215 | 35.52903366088867 | 9.597844123840332 |
pip freeze | grep transformers
transformers @ git+https://github.com/huggingface/transformers@5bb4ec6233d6414a922ad2818f0bcf879de81c28
Would you have an idea why that is and what can influence this? Did you use Triton or Cuda?
Oh, that's interesting.
To be honest I don't think I did anything special. The commands I use are in my READMEs, but for example:
CUDA_VISIBLE_DEVICES=0 python3 llama.py vicuna-13B-1.1-HF c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors /workspace/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors
Lately I have always been using the Triton branch for making GPTQs. And I made this GPTQ - and all my recent ones - with commit 58c8ab4c7aaccc50f507fd08cce941976affe5e0
on the qwopqwop repo. Which was the last commit on April 13th, before he started the refactor that broke everything.
In terms of pip versions. I can't check precise commits because I do all of this in the cloud and each instance gets destroyed afterwards. Until recently I was pulling peft and transformers from their respective Githubs, but after transformers finally released 4.28.0 a couple of days ago I started using the standard versions. I did this GPTQ four days ago so I think that would have been using Github transformers. PyTorch is 2.0.0+cu118. Triton is 2.0.0.
Here's my initialisation script that installs all my dependencies:
echo -n "PIP OTHERS: "
(pip3 uninstall -qy transformers peft datasets loralib sentencepiece safetensors accelerate triton bitsandbytes huggingface_hub flexgen rwkv quant-cuda && \
pip3 install -q datasets==2.10.1 loralib sentencepiece safetensors==0.3.0 accelerate==0.18.0 triton==2.0.0 huggingface_hub && \
pip3 install -q transformers && \
pip3 install -q peft && \
pip3 install -q bitsandbytes==0.37.2 xformers && \
pip3 install -q markdown pyyaml tqdm requests gradio==3.24.1 flexgen==0.1.7 rwkv==0.7.3 ninja ) >/dev/null 2>errors.pip && echo " DONE" || cat errors.pip
echo -n "GIT SETUP: "
( git config --global credential.helper store && \
git config --global user.email "XXX" && \
git config --global user.name "TheBloke" && \
huggingface-cli login --add-to-git-credential --token 'XXX' && \
git lfs install ) >/dev/null 2>errors.gitsetup && echo " DONE" || cat errors.gitsetup
echo -n "GIT GPTQ: "
( git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa gptq-llama && \
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda gptq-llama-cuda && \
git clone https://github.com/oobabooga/GPTQ-for-LLaMa ooba-gptq-llama && \
git clone -n https://github.com/qwopqwop200/GPTQ-for-LLaMa gptq-safe && cd gptq-safe && git checkout 58c8ab4c7aaccc50f507fd08cce941976affe5e0 ) >/dev/null 2>errors.gitgptq && echo " DONE" || cat errors.gitgptq
echo -n "WEBUI SETUP: "
( rm -rf /content/text-generation-webui && \
git clone https://github.com/oobabooga/text-generation-webui && \
mkdir -p text-generation-webui/repositories && \
ln -s /content/gptq-safe text-generation-webui/repositories/GPTQ-for-LLaMa ) >/dev/null 2>errors.webui && echo " DONE" || cat errors.webui
I actually re-made this particular GPTQ four days ago, because I had realised that my original vicuna-13B-1.1-HF
repo had been converted to HF with a buggy version of the transformers models/llama/convert_llama_weights_to_hf.py
script which caused the 13B models to use 37GB on disk instead of 26GB. So I re-converted my vicuna-13B-1.1-HF
repo, and then just in case that affected the GPTQ I also re-made the GPTQs.
No idea if any of that would affect this, but that's what I did! I suppose that might mean I used a later version of GPTQ-for-LLaMa than you did?
Question for you if you've got a sec: what code/method do you use to produce those metrics? So far the only evaluation I've done on models has been for GGML models, using llama.cpp's perplexity binary. I've been meaning to try evaluating GPU models but haven't looked into it yet.
Thank you for the detailed answer. I'll look into this!
I am using cuda117 instead of cuda118, but I doubt that could be it... I also use a more recent version of bitsandbytes==0.38.1.
All on WSL, but I'm thinking of giving it a try in Google Colab (I believe this is what you're using).
To generate the metrics, enter the directory where you have your safetensor and execute:python /content/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py . c4 --wbits 4 --groupsize 128 --load vicuna-13B-1.1-GPTQ-4bit-128g.safetensors --new-eval --eval
You can also try instead of --new-eval --eval
to use --eval
alone or --benchmark 2048 --check
.
I remember our conversation about the size difference. ;)
Oh that was you! Sorry. I've had so many discussions here since I started uploading models that I can't remember all the different usernames :)
Thanks for the details on the evaluation, I'll have to try that.
And yes I have used Google Colab a lot as I don't have an NVidia GPU at home yet - my home GPU is an AMD 6900XT 16GB on macOS, and that's not well supported at all for this sort of thing. Also my internet upload is only 4MB/s which takes forever when dealing with larger models. Uploads from the cloud are way quicker. And if I need to reboot my PC or something, it won't interrupt anything.
Most of my GPTQs I did with Colab, but lately I've started to move over to Runpod. They have a lot more hardware options, and they support SSH and allow you to open any TCP ports if you want which Colab doesn't officially support. On Google Colab there's only two GPU options: T5 15GB or A100 40GB. I found that for most of what I was doing, T5 was too small and slow, and A100 was more than I needed. On Runpod I can pay $0.29/hr for a 3090, $0.69/hr for a 4090, $0.89/hr for an A100 40GB or $1.89/hr for an A100 80GB. And then when it's done it's usually way faster to upload to HF from a server than it is from my home internet.