Update README.md
Browse files
README.md
CHANGED
@@ -23,30 +23,27 @@ license: apache-2.0
|
|
23 |
|
24 |
# Falcon 40B-Instruct GGML
|
25 |
|
26 |
-
These files are
|
27 |
|
28 |
-
These
|
29 |
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
|
|
34 |
|
35 |
## Repositories available
|
36 |
|
37 |
* [4-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
|
38 |
* [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
|
39 |
-
* [2, 3, 4, 5, 6, 8-bit
|
40 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
|
41 |
|
42 |
<!-- compatibility_ggml start -->
|
43 |
## Compatibility
|
44 |
|
45 |
-
|
46 |
-
|
47 |
-
For use from Python code, use [ctransformers](https://github.com/marella/ctransformers). Again, with preliminary CUDA GPU acceleration
|
48 |
-
|
49 |
-
Or to build cmp-nct's fork of llama.cpp with Falcon 7B support plus preliminary CUDA acceleration, please try the following steps:
|
50 |
|
51 |
```
|
52 |
git clone https://github.com/cmp-nct/ggllm.cpp
|
@@ -67,18 +64,19 @@ Adjust `-t 8` (the number of CPU cores to use) according to what performs best o
|
|
67 |
|
68 |
`-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
|
69 |
|
|
|
|
|
70 |
<!-- compatibility_ggml end -->
|
71 |
|
72 |
## Provided files
|
73 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
74 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
75 |
-
| falcon40b-instruct.
|
76 |
-
| falcon40b-instruct.
|
77 |
-
| falcon40b-instruct.
|
78 |
-
| falcon40b-instruct.
|
79 |
-
| falcon40b-instruct.
|
80 |
-
| falcon40b-instruct.
|
81 |
-
|
82 |
|
83 |
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
|
84 |
|
|
|
23 |
|
24 |
# Falcon 40B-Instruct GGML
|
25 |
|
26 |
+
These files are GGCC format model files for [Falcon 40B Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
|
27 |
|
28 |
+
These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
|
29 |
|
30 |
+
GGCC is a new format created in a new fork of llama.cpp that introduced this new Falcon GGML-based support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
|
31 |
+
|
32 |
+
Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
|
33 |
+
|
34 |
+
For GGMLv3 files compatible with those UIs, please see the `ggmlv3` branch.
|
35 |
|
36 |
## Repositories available
|
37 |
|
38 |
* [4-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
|
39 |
* [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
|
40 |
+
* [2, 3, 4, 5, 6, 8-bit GGCT models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GGML)
|
41 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
|
42 |
|
43 |
<!-- compatibility_ggml start -->
|
44 |
## Compatibility
|
45 |
|
46 |
+
To build cmp-nct's fork of llama.cpp with Falcon support plus CUDA acceleration, please try the following steps:
|
|
|
|
|
|
|
|
|
47 |
|
48 |
```
|
49 |
git clone https://github.com/cmp-nct/ggllm.cpp
|
|
|
64 |
|
65 |
`-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
|
66 |
|
67 |
+
Please see https://github.com/cmp-nct/ggllm.cpp for further details and instructions.
|
68 |
+
|
69 |
<!-- compatibility_ggml end -->
|
70 |
|
71 |
## Provided files
|
72 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
73 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
74 |
+
| falcon40b-instruct.ggccv1.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
|
75 |
+
| falcon40b-instruct.ggccv1.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
|
76 |
+
| falcon40b-instruct.ggccv1.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
|
77 |
+
| falcon40b-instruct.ggccv1.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
|
78 |
+
| falcon40b-instruct.ggccv1.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
|
79 |
+
| falcon40b-instruct.ggccv1.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
|
|
|
80 |
|
81 |
**Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
|
82 |
|