File size: 870 Bytes
0d1ee3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
---
license: llama3.1
tags:
- java
- llama
- llama3
- gguf
- llama3.java
---

# GGUF models for llama3.java
Pure .gguf `Q4_0` and `Q8_0` quantizations of Llama 3 8B instruct, ready to consume by [llama3.java](https://github.com/mukel/llama3.java).

In the wild, `Q8_0` quantizations are fine, but `Q4_0` quantizations are rarely pure e.g. the `output.weights` tensor is quantized with `Q6_K`, instead of `Q4_0`.  
A pure `Q4_0` quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp as follows:

```
./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
```

# Meta-Llama-3.1-8B-Instruct-GGUF

- This is GGUF quantized version of [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) created using llama.cpp