InferenceIllusionist
/

Mistral-Nemo-Instruct-12B-iMat-GGUF

Inference Endpoints

Model card Files Files and versions Community

InferenceIllusionist commited on Jul 21

Commit

756c8c2

•

1 Parent(s): 888d036

Update README.md

Formatting and clarity.

Files changed (1) hide show

README.md +7 -9

README.md CHANGED Viewed

@@ -29,21 +29,19 @@ Other front-ends like the main branch of llama.cpp, kobold.cpp, and text-generat
 Quantized from Mistral-Nemo-Instruct-2407 fp16
 * Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 92 chunks and n_ctx=512
 * Static fp16 will also be included in repo
-For a brief rundown of iMatrix quant performance please see this [PR](https://github.com/ggerganov/llama.cpp/pull/5747)
-<i>All quants are verified working prior to uploading to repo for your safety and convenience</i>
 <b>KL-Divergence Reference Chart</b>
  (Click on image to view in full size)
 [<img src="https://i.imgur.com/mV0nYdA.png" width="920"/>](https://i.imgur.com/mV0nYdA.png)
-<b>Tip:</b> If you are getting a `cudaMalloc failed: out of memory` error, try passing an argument for lower context in llama.cpp, e.g. for 8k: `-c 8192`
-If you have all ampere generation or newer cards, you can use flash attention like so: `-fa`
-Provided Flash Attention is enabled you can also use quantized cache to save on VRAM e.g. for 8-bit: `-ctk q8_0 -ctv q8_0`
 Original model card can be found [here](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)

 Quantized from Mistral-Nemo-Instruct-2407 fp16
 * Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 92 chunks and n_ctx=512
 * Static fp16 will also be included in repo
+* For a brief rundown of iMatrix quant performance please see this [PR](https://github.com/ggerganov/llama.cpp/pull/5747)
+* <i>All quants are verified working prior to uploading to repo for your safety and convenience</i>
 <b>KL-Divergence Reference Chart</b>
  (Click on image to view in full size)
 [<img src="https://i.imgur.com/mV0nYdA.png" width="920"/>](https://i.imgur.com/mV0nYdA.png)
+<b>Quant-specific Tips:</b>
+* If you are getting a `cudaMalloc failed: out of memory` error, try passing an argument for lower context in llama.cpp, e.g. for 8k: `-c 8192`
+* If you have all ampere generation or newer cards, you can use flash attention like so: `-fa`
+* Provided Flash Attention is enabled you can also use quantized cache to save on VRAM e.g. for 8-bit: `-ctk q8_0 -ctv q8_0`
+* Mistral recommends a temperature of 0.3 for this model
 Original model card can be found [here](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)