InferenceIllusionist
commited on
Commit
•
756c8c2
1
Parent(s):
888d036
Update README.md
Browse filesFormatting and clarity.
README.md
CHANGED
@@ -29,21 +29,19 @@ Other front-ends like the main branch of llama.cpp, kobold.cpp, and text-generat
|
|
29 |
Quantized from Mistral-Nemo-Instruct-2407 fp16
|
30 |
* Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 92 chunks and n_ctx=512
|
31 |
* Static fp16 will also be included in repo
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
<i>All quants are verified working prior to uploading to repo for your safety and convenience</i>
|
36 |
|
37 |
<b>KL-Divergence Reference Chart</b>
|
38 |
(Click on image to view in full size)
|
39 |
[<img src="https://i.imgur.com/mV0nYdA.png" width="920"/>](https://i.imgur.com/mV0nYdA.png)
|
40 |
|
41 |
|
42 |
-
<b>
|
43 |
-
|
44 |
-
If you have all ampere generation or newer cards, you can use flash attention like so: `-fa`
|
45 |
-
|
46 |
-
|
47 |
|
48 |
|
49 |
Original model card can be found [here](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
|
|
|
29 |
Quantized from Mistral-Nemo-Instruct-2407 fp16
|
30 |
* Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 92 chunks and n_ctx=512
|
31 |
* Static fp16 will also be included in repo
|
32 |
+
* For a brief rundown of iMatrix quant performance please see this [PR](https://github.com/ggerganov/llama.cpp/pull/5747)
|
33 |
+
* <i>All quants are verified working prior to uploading to repo for your safety and convenience</i>
|
|
|
|
|
34 |
|
35 |
<b>KL-Divergence Reference Chart</b>
|
36 |
(Click on image to view in full size)
|
37 |
[<img src="https://i.imgur.com/mV0nYdA.png" width="920"/>](https://i.imgur.com/mV0nYdA.png)
|
38 |
|
39 |
|
40 |
+
<b>Quant-specific Tips:</b>
|
41 |
+
* If you are getting a `cudaMalloc failed: out of memory` error, try passing an argument for lower context in llama.cpp, e.g. for 8k: `-c 8192`
|
42 |
+
* If you have all ampere generation or newer cards, you can use flash attention like so: `-fa`
|
43 |
+
* Provided Flash Attention is enabled you can also use quantized cache to save on VRAM e.g. for 8-bit: `-ctk q8_0 -ctv q8_0`
|
44 |
+
* Mistral recommends a temperature of 0.3 for this model
|
45 |
|
46 |
|
47 |
Original model card can be found [here](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
|