VRAM Estimates

by ernestr - opened Mar 5

Mar 5

Thanks so much for your reviews and merges!

Could you provide estimates of VRAM usage for the EXL2 quants given varying context size e.g. 16k & 32k (or point me to steps to allow me to calculate myself given the specific tokenizer and the size of the repo)?

wolfram

Owner Mar 5

I put that information on the EXL2 versions' model cards:

Max Context w/ 48 GB VRAM: (24 GB VRAM is not enough, even for 2.4bpw, use GGUF instead!)

2.4bpw: 32K (32768 tokens) w/ 8-bit cache, 21K (21504 tokens) w/o 8-bit cache
2.65bpw: 30K (30720 tokens) w/ 8-bit cache, 15K (15360 tokens) w/o 8-bit cache
3.0bpw: 12K (12288 tokens) w/ 8-bit cache, 6K (6144 tokens) w/o 8-bit cache

ernestr

Mar 6

•

edited Mar 6

Thanks! In case folks are curious about 64 GB of VRAM, here is where I maxed out with 3.5bpw. I'm getting ~3-4t/s with minimal context in cache.

3.5bpw: 20K (20,000 tokens) w/ 8-bit cache

invictus1

Mar 16

•

edited Mar 16

At 5.0bpw with 4bit cache and full context I'm using 76.8gb of ram and its generating at 11-13t/s. This is with a A100 80gb. Also Wolfram I absolutely love this model thank you so much for making something this godly!

wolfram

Owner Mar 16

Thanks, guys, for all of this information. And now I want an A100, too! ;)

I'm happy how it turned out, but didn't do much besides merging and converting and quantizing the already godly components others provided. But I'm glad you like it so much! :)

cloudyu

Apr 2

I bought a mac m2 ultra with 192G ram recently. Can the EXL2 versions' model run on mac?

Adzeiros

Jun 4

I put that information on the EXL2 versions' model cards:

Max Context w/ 48 GB VRAM: (24 GB VRAM is not enough, even for 2.4bpw, use GGUF instead!)

2.4bpw: 32K (32768 tokens) w/ 8-bit cache, 21K (21504 tokens) w/o 8-bit cache

2.65bpw: 30K (30720 tokens) w/ 8-bit cache, 15K (15360 tokens) w/o 8-bit cache

3.0bpw: 12K (12288 tokens) w/ 8-bit cache, 6K (6144 tokens) w/o 8-bit cache

Just curious, I have a dual 3090 setup, I cannot run 3.0bpw on it at all. Even with 8k context on 4bit cache... Any tips on how I can get it to work?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment