Is the snippet in README loading the model in 8-bit mode?
Passing load_in_8bit=True
fails. Does this mean the model runs in 8-bit mode without the need to pass the argument? What dtype
should be?
What is the issue that you are facing?
Could you please provide a script to load this model in plain Python?
It failed to load into oobabooga/text-generation-webui for CPU inference:
RuntimeError: No GPU found. A GPU is needed for quantization.
bitsandbytes 8-Bit Quantization requires a GPU that can hold the whole model, it is not compatible with CPU inference.
the script you provided in README is loading model in full precision, do we need to pass load_in_8bit ? As of now, it downloads full checkpoints when loaded using .from_pretrained ( .... )
The weight files are only 40+ GB instead of 90+ GB for the full precision mode. I usually pass load_in_8bit when loading as well but it should be impossible to load the original weights using this repo.