VPTQ-community/Qwen2.5-72B-Instruct-v8-k65536-256-woft

I'm sorry if this is confusing. The model's name includes the vector length, codebook (lookup table) size, and residual codebook size. For example, "Qwen2.5-72B-Instruct-v8-k65536-256-woft" refers to "Qwen2.5-72B-Instruct", where: Vector Length is 8, Number of Centroids is 65536 (2^16), Number of Residual Centroids is 256 (2^8). The equivalent bitwidth calculation is:

Index: log2(65536) = 16 / 8 = 2 bits,
Residual Index: log2(256) = 8 / 8 = 1 bit,
Total Bitwidth: 2 + 1 = 3 bits,
Model Size Estimation: 70B * 3 bits / 8 bits per Byte = 26.25 GB.

You can refer to this table for an estimation of the bitwidth: https://github.com/microsoft/vptq?tab=readme-ov-file#models-from-open-source-community

V16 means the vector length is 16, which means vectors of length 16 are represented by a single index.
For example, the model "Qwen2.5-72B-Instruct-v16-k65536-65536-woft" available at https://huggingface.co/VPTQ-community/Qwen2.5-72B-Instruct-v16-k65536-65536-woft represents: a vector length of 16, with Number of Centroids: 65536 (2^16), and Number of Residual Centroids: 65536 (2^16). The equivalent bitwidth calculation is:
Index: log2(65536) = 16 / 16 = 1 bit, Residual Index: log2(65536) = 16 / 16 = 1 bit, Total Bitwidth: 1 + 1 = 2 bits.
Typically, a larger bitwidth offers higher intelligence.

VPTQ-community
/

Qwen2.5-72B-Instruct-v8-k65536-256-woft

V8 vs V16?