leafspark
/

DeepSeek-V2-Chat-GGUF

@@ -7,21 +7,20 @@ tags:
 - deepseek
 - gguf
 - bf16
-- chinese
-- english
 metrics:
 - accuracy
 ---
 # Deepseek-V2-Chat-GGUF
 Quantizised from [https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat)
-Using llama.cpp fork: [https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2](https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2)
-TODO: Make llamafile for Q2_K and Q4_K_M
-# Warning: This will not work unless you compile llama.cpp from the repo provided (and set metadata KV overrides)!
 # How to use:
@@ -79,27 +78,28 @@ quantize \
 # Quants:
 ```
 - bf16 [size: 439gb]
-- q8_0 (later, please use q4_k_m for now) [estimated size: 233.27gb]
 - q4_k_m [size: 132gb]
 - q2_k [size: 80gb]
 - iq2_xxs [size: 61.5gb]
 - iq3_xs (uploading) [size: 89.6gb]
-- iq1_m [size: 27.3gb]
 ```
 Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected a lot.
-# Planned Quants (using importance matrix):
 ```
 - q5_k_m
 - q5_k_s
-- q3_k_m
 - q6_k
 - iq4_nl
 - iq4_xs
 - iq2_xs
 - iq2_s
 - iq2_m
 - iq1_s (note: for fun only, this quant is likely useless)
 ```
@@ -113,7 +113,7 @@ deepseek2.expert_shared_count=int:2
 deepseek2.expert_feed_forward_length=int:1536
 deepseek2.experts_weight_scale=int:16
 deepseek2.leading_dense_block_count=int:1
-rope.scaling.yarn_log_multiplier=float:0.0707
 ```
 A precompiled AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip` in the root of this repo.

 - deepseek
 - gguf
 - bf16
 metrics:
 - accuracy
+language:
+- en
+- zh
 ---
 # Deepseek-V2-Chat-GGUF
 Quantizised from [https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat)
+Using llama.cpp b3026 for quantizisation
+# Warning: This will not work unless you set metadata KV overrides, nor will it in LM Studio/similar wrapper apps!
 # How to use:
 # Quants:
 ```
 - bf16 [size: 439gb]
+- q8_0 [estimated size: 233.27gb]
 - q4_k_m [size: 132gb]
 - q2_k [size: 80gb]
 - iq2_xxs [size: 61.5gb]
 - iq3_xs (uploading) [size: 89.6gb]
+- iq1_m (uploading) [size: 27.3gb]
+- q3_k_m (uploading) [size: 92.6gb]
 ```
 Note: Use iMatrix quants only if you can fully offload to GPU, otherwise speed will be affected a lot.
+# Planned Quants (weighted/imatrix):
 ```
 - q5_k_m
 - q5_k_s
 - q6_k
 - iq4_nl
 - iq4_xs
 - iq2_xs
 - iq2_s
 - iq2_m
+- iq3_xxs
 - iq1_s (note: for fun only, this quant is likely useless)
 ```
 deepseek2.expert_feed_forward_length=int:1536
 deepseek2.experts_weight_scale=int:16
 deepseek2.leading_dense_block_count=int:1
+deepseek2.rope.scaling.yarn_log_multiplier=float:0.0707
 ```
 A precompiled AVX2 version is avaliable at `llama.cpp-039896407afd40e54321d47c5063c46a52da3e01.zip` in the root of this repo.