Transformers
English
falcon
TheBloke commited on
Commit
52df286
1 Parent(s): b506f71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -19
README.md CHANGED
@@ -23,30 +23,27 @@ license: apache-2.0
23
 
24
  # Falcon 40B-Instruct GGML
25
 
26
- These files are **experimental** GGML format model files for [Falcon 40B Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
27
 
28
- These GGML files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
29
 
30
- They can be used with:
31
- * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui).
32
- * The ctransformers Python library, which includes LangChain support: [ctransformers](https://github.com/marella/ctransformers).
33
- * A new fork of llama.cpp that introduced this new Falcon GGML support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
 
34
 
35
  ## Repositories available
36
 
37
  * [4-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
38
  * [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
39
- * [2, 3, 4, 5, 6, 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GGML)
40
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
41
 
42
  <!-- compatibility_ggml start -->
43
  ## Compatibility
44
 
45
- The recommended UI for these GGMLs is [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui). Preliminary CUDA GPU acceleration is provided.
46
-
47
- For use from Python code, use [ctransformers](https://github.com/marella/ctransformers). Again, with preliminary CUDA GPU acceleration
48
-
49
- Or to build cmp-nct's fork of llama.cpp with Falcon 7B support plus preliminary CUDA acceleration, please try the following steps:
50
 
51
  ```
52
  git clone https://github.com/cmp-nct/ggllm.cpp
@@ -67,18 +64,19 @@ Adjust `-t 8` (the number of CPU cores to use) according to what performs best o
67
 
68
  `-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
69
 
 
 
70
  <!-- compatibility_ggml end -->
71
 
72
  ## Provided files
73
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
74
  | ---- | ---- | ---- | ---- | ---- | ----- |
75
- | falcon40b-instruct.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
76
- | falcon40b-instruct.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
77
- | falcon40b-instruct.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
78
- | falcon40b-instruct.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
79
- | falcon40b-instruct.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
80
- | falcon40b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
81
-
82
 
83
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
84
 
 
23
 
24
  # Falcon 40B-Instruct GGML
25
 
26
+ These files are GGCC format model files for [Falcon 40B Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
27
 
28
+ These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
29
 
30
+ GGCC is a new format created in a new fork of llama.cpp that introduced this new Falcon GGML-based support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
31
+
32
+ Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
33
+
34
+ For GGMLv3 files compatible with those UIs, please see the `ggmlv3` branch.
35
 
36
  ## Repositories available
37
 
38
  * [4-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
39
  * [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
40
+ * [2, 3, 4, 5, 6, 8-bit GGCT models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GGML)
41
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
42
 
43
  <!-- compatibility_ggml start -->
44
  ## Compatibility
45
 
46
+ To build cmp-nct's fork of llama.cpp with Falcon support plus CUDA acceleration, please try the following steps:
 
 
 
 
47
 
48
  ```
49
  git clone https://github.com/cmp-nct/ggllm.cpp
 
64
 
65
  `-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
66
 
67
+ Please see https://github.com/cmp-nct/ggllm.cpp for further details and instructions.
68
+
69
  <!-- compatibility_ggml end -->
70
 
71
  ## Provided files
72
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
73
  | ---- | ---- | ---- | ---- | ---- | ----- |
74
+ | falcon40b-instruct.ggccv1.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
75
+ | falcon40b-instruct.ggccv1.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
76
+ | falcon40b-instruct.ggccv1.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
77
+ | falcon40b-instruct.ggccv1.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
78
+ | falcon40b-instruct.ggccv1.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
79
+ | falcon40b-instruct.ggccv1.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 
80
 
81
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
82