hivemind/gpt-j-6B-8bit · load_in_8bit fine-tuning requires more memory than this notebook

TL;DR

load_in_8bit does forward pass faster, especially for small batches || this implementation is slower because it needs to de-quantize weights, while load_in_8bit runs forward pass with quantized weights
load_in_8bit currently requires Turing GPUs or newer (e.g. colab T4 or 2080 are fine, colab K80 or 1080Ti are not) || this implementation works with any GPU or CPU
load_in_8bit currently supports only forward pass, i.e. no finetuning, BUT they are working on LoRA implementation there and will post update in a few weeks.

Is it LoRA, and how could this be implemented with load_in_8bit?

Currently, it requires some coding:

please install the latest bitsandbytes (i.e. this week's version)
write a LoRA wrapper around bnb.nn.Linear8bitLt
-- in this wrapper, make sure you pass has_fp16_weights=True and memory_efficient_backward=True (see example test)
use your wrapped layer instead of standard bnb.nn.Linear8bitLt

Or wait for a couple of weeks till bnb and HF guys do that for you ;)