Curious to know how 8k context llama2 got trained on 24gb GPU
Can you share the finetuning script, how did you train 8k context length llama2 got trained on 24gb GPU?
I trained it on a RTX 6000 Ada, which has 48gb VRAM. However, for this model, I didn't actually perform any training at 8k context length (unlike the first airophin model). I started from another model checkpoint that had been trained on such.
As far as the finetuning script goes, It's basically a modified version of the qlora script in the original qlora paper. I have a version of it here (there are differences to what I used here; I may update it when I get a chance): https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/qlora_airo.py
Thanks @bhenrym14 for the clarification. Just wanted to confirm what you have used as model_max_len, 8192 or default one i.e 2048 in mentioned script ? Also can you confirm if similar script was used to finetune the first airophin model(bhenrym14/airophin-13b-pntk-16k-fp16), if yes then what was value of model_max_len there and gpu type and how many gpus?
This script relies on the RoPE monkey-patch to apply the desired interpolation factor (where I just hard-coded it); this is because I wrote this before transformers had native support for RoPE scaling. So yes, for this model, I did scale appropriately (factor of 2) for 8192 context. Since tranformers now has support, I generally will edit the backbone config to include the desired scaling method, and use the model_max_len to control the maximum sequence length the model sees in training; this is simply so I can run larger batches without risking OOM on a couple of samples in an otherwise shorter sequence dataset.
For the airoboros finetune phase, I capped it at ~3000 for max_model_len (again, RoPE scaling is still for 8192). I trained on a single gpu. It was a RTX 6000 Ada generation.