Awesome model

#5
by dillfrescott - opened

When reflection turned out to be a cash grab, THIS model should be the real star of the moment. Very good model all around IMO.

Please, could you give some insight into why you feel this model is so good? I just assumed this was a 'unification' model that combined the strengths of V2-Chat and Coder-V2-Instruct without really changing anything. Some explanation and examples would be appreciated.

tl;dr;

It inferences quite fast even with most of the model offloaded into CPU RAM.

Details

I skipped over V2-Chat and Coder-V2-Instruct and just tried this V2.5 on llama.cpp with bartowski/DeepSeek-V2.5-GGUF IQ3_XXS on my R9 9950X w/ 96GiB RAM and 1x 3090TI FE w/ 24GiB VRAM.

While llama.cpp support doesn't seem complete yet with at least chat template and also crashing:

  1. The chat template that comes with this model is not yet supported, falling back to chatml.
  2. Deepseek2 does not support K-shift
  3. flash_attn requires n_embd_head_k == n_embd_head_v - forcing off

However, all that said, IT IS FAST! Getting 6-7 tok/sec as compared to Mistral-Large which gets barely 1-2 tok/sec with similar RAM usage. To be fair, only have 1k context right now as it will OOM easily on my limited hardware, but I haven't fiddled much with kv cache quants given the flash_attn thing going on in 3 above.

I wanted to try a different inference engine that supports offloading, but couldn't get ktransformers to build (might need to use older python 3.11 or 3.10 maybe)...

Anyone else having luck with it? What local hardware / inference engine are you using?

I tried ktransformers but couldn't get it to work on windows. Good to know about the deepseek t/s speed, I might have to give it a try with my 3090 and 96gb ram. Do you know how to solve the k shift error, I'm using oogabooga and no matter what I try it still crashes

@sm54 thanks for your report, yeah ktransformers is a bit tricky to get running it seems likely due to python wheel stuff (i'm trying on linux).

The relevant bit from the Deepseek2 K-shift github issue linked above is

num_predict must be less than or equal to num_ctx / process count.

I'm not 100% sure, but for now I'm just limiting my n-predict a little bit lower than the ctx-size. However, it is still crashing on longer generations or larger prompts...

Ahh I see, DeepSeek-V2.5 is also an MoE so only 22B parameters are active at a time - that is why it inferences much faster... now if only it wouldn't crash out with the k_shift error and actually supported flash attention so i could get over 1k context...

Sign up or log in to comment