load_in_8bit fine-tuning requires more memory than this notebook
I found, and was using this example before I found out about load_in_8bit. It worked and I was able to fine-tune the model on colab.
After fine-tuning and save_pretrained
, I realised that I was unable to load the fine-tuned model in another notebook using from_pretrained
discovering that there were version issues with pytorch and transformers.
I've been trying to use load_in_8bit to fine-tune however it fills the gpu memory and crashes as soon as the training loop starts.
What's the difference between this notebook and load_in_8bit
?
Is it LoRA, and how could this be implemented with load_in_8bit
?
Thanks
TL;DR
- load_in_8bit does forward pass faster, especially for small batches || this implementation is slower because it needs to de-quantize weights, while load_in_8bit runs forward pass with quantized weights
- load_in_8bit currently requires Turing GPUs or newer (e.g. colab T4 or 2080 are fine, colab K80 or 1080Ti are not) || this implementation works with any GPU or CPU
- load_in_8bit currently supports only forward pass, i.e. no finetuning, BUT they are working on LoRA implementation there and will post update in a few weeks.
Is it LoRA, and how could this be implemented with load_in_8bit?
Currently, it requires some coding:
- please install the latest bitsandbytes (i.e. this week's version)
- write a LoRA wrapper around bnb.nn.Linear8bitLt
-- in this wrapper, make sure you pass has_fp16_weights=True and memory_efficient_backward=True (see example test) - use your wrapped layer instead of standard bnb.nn.Linear8bitLt
Or wait for a couple of weeks till bnb and HF guys do that for you ;)