EXL2 quants of gemma-2-27b-it
My quants are meant to be a tight fit in 24 GB VRAM. Following VRAM usage numbers assume 8k context.
bpw | head | 4 bit cache | 16 bit cache | Notes |
---|---|---|---|---|
5.8 | 8 bit | 21.85 GB | 23.69 GB | 16 bit cache, but lower BPW |
๐ 6.5 | 8 bit | 23.81 GB | 25.65 GB | ๐ my recommendation |
6.6 | 6 bit | 23.86 GB | 25.70 GB | slightly higher BPW, but less precise head |
For this model the difference between 6 bit and 8 bit head is ~300 MB, it's not huge. It could be exchanged for about 0.1 bpw in the body.
Check out turboderp's quants & measurement.json:
3.00 bits per weight
3.50 bits per weight
4.00 bits per weight
4.50 bits per weight
5.00 bits per weight
6.00 bits per weight
8.00 bits per weight