mukel commited on
Commit
0d1ee3a
1 Parent(s): cdd8086

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -3
README.md CHANGED
@@ -1,3 +1,23 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ tags:
4
+ - java
5
+ - llama
6
+ - llama3
7
+ - gguf
8
+ - llama3.java
9
+ ---
10
+
11
+ # GGUF models for llama3.java
12
+ Pure .gguf `Q4_0` and `Q8_0` quantizations of Llama 3 8B instruct, ready to consume by [llama3.java](https://github.com/mukel/llama3.java).
13
+
14
+ In the wild, `Q8_0` quantizations are fine, but `Q4_0` quantizations are rarely pure e.g. the `output.weights` tensor is quantized with `Q6_K`, instead of `Q4_0`.
15
+ A pure `Q4_0` quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp as follows:
16
+
17
+ ```
18
+ ./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
19
+ ```
20
+
21
+ # Meta-Llama-3.1-8B-Instruct-GGUF
22
+
23
+ - This is GGUF quantized version of [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) created using llama.cpp