IQ1_S or IQ_M for low RAM/VRAM computers
Or if you upload the imatrix.dat, this will be very welcome for poor computers.
So I tried to do 1-bit, it asked for imatrix data! I have never done that, could you tell me how to do it? I can do the imatrix and share all the 1Q models quickly
- Grab a copy of group_10_merged.txt from https://github.com/ggerganov/llama.cpp/discussions/5263
- W/ the f16 gguf file, run: ~/llama.cpp/imatrix -m ggml-model.f16.gguf -f group_10_merged.txt
- Wait a while;
- When running quantize, add this arg: --imatrix imatrix.dat
\o/
(The quality of all your other low-bit-rate quantizations will improve as well!)
- Grab a copy of group_10_merged.txt from https://github.com/ggerganov/llama.cpp/discussions/5263
- W/ the f16 gguf file, run: ~/llama.cpp/imatrix -m ggml-model.f16.gguf -f group_10_merged.txt
- Wait a while;
- When running quantize, add this arg: --imatrix imatrix.dat
\o/
(The quality of all your other low-bit-rate quantizations will improve as well!)
Will do this in an hour! Thanks a lot! So I do this for IQ1_S
and IQ1_M
?
At minimum. The same imatrix.dat file can be used for all quantization levels though - it would be good to remake any of the IQ*'s at minimum, any of the other ones you can!
At minimum. The same imatrix.dat file can be used for all quantization levels though - it would be good to remake any of the IQ*'s at minimum, any of the other ones you can!
You seem to have more knowledge about this imatrix
, is it for all the quantized models starting with IQ
regardless of their size? If yes, why isn't it happening automatically inside the quantize
script? (just asking out of curiosity)
I still have the 16bit which takes forever to make, I will do the imatrix
and start with the 1bits, then see what other IQ
I have
imatrix.dat is effective for quants Q5_K_M or smaller. Even perplexities of Q4_0 or Q3_K_S will be better with imatrix.dat.
So not the groups_merged.txt, but the group_10_merged.txt
?
So not the groups_merged.txt, but the
group_10_merged.txt
?
I prefer groups_merged.txt but it’s up to you. Someone uses wiki.train.raw from wikitext and it is very large.
OK I'll go with groups_merged.txt
which seems to be more diverse
system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 110.547 ms
compute_imatrix: computing over 105 chunks with batch_size 512
compute_imatrix: 62.96 seconds per pass - ETA 1 hours 50.17 minutes
[1]2.9595,[2]2.4039,
I have uploaded both IQ1_S
and IQ1_M
, the IQ1_M
took a long time! I think the imatrix made this one much longer. I'll see if I can evaluate the other quants and see how much difference the imatrix would have
I have uploaded both
IQ1_S
andIQ1_M
, theIQ1_M
took a long time! I think the imatrix made this one much longer. I'll see if I can evaluate the other quants and see how much difference the imatrix would have
I really appreciate it! Thank you very much!!!
Thank you for sharing how to do imatrix, appreciate it! :)