TheBloke
/

WizardLM-Uncensored-Falcon-40B-GGML

Transformers

falcon

text-generation-inference

Model card Files Files and versions Community

TheBloke commited on Jul 4, 2023

Commit

b59e7d2

•

1 Parent(s): 31281d8

Update README.md

Browse files

Files changed (1) hide show

README.md +15 -17

README.md CHANGED Viewed

@@ -19,14 +19,15 @@ license: other
 # Eric Hartford's WizardLM Uncensored  Falcon 40B GGML
-These files are **experimental** GGML format model files for [Eric Hartford's WizardLM Uncensored  Falcon 40B](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-40b).
-These GGML files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
-They can be used from:
-* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui).
-* The ctransformers Python library, which includes LangChain support: [ctransformers](https://github.com/marella/ctransformers).
-* A new fork of llama.cpp that introduced this new Falcon GGML support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
 ## Repositories available
@@ -38,11 +39,7 @@ They can be used from:
 <!-- compatibility_ggml start -->
 ## Compatibility
-The recommended UI for these GGMLs is [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui). Preliminary CUDA GPU acceleration is provided.
-For use from Python code, use [ctransformers](https://github.com/marella/ctransformers). Again, with preliminary CUDA GPU acceleration
-Or to build cmp-nct's fork of llama.cpp with Falcon 7B support plus preliminary CUDA acceleration, please try the following steps:
 ```
 git clone https://github.com/cmp-nct/ggllm.cpp
@@ -63,17 +60,18 @@ Adjust `-t 8` (the number of CPU cores to use) according to what performs best o
 `-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU.  If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
 <!-- compatibility_ggml end -->
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
-| wizard-falcon40b.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | Uses GGML_TYPE_Q2_K for all tensors. |
-| wizard-falcon40b.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q3_K for all tensors |
-| wizard-falcon40b.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | Uses GGML_TYPE_Q4_K for all tensors |
-| wizard-falcon40b.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | Uses GGML_TYPE_Q5_K for all tensors |
-| wizard-falcon40b.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
-| wizard-falcon40b.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

 # Eric Hartford's WizardLM Uncensored  Falcon 40B GGML
+These files are GGCC format model files for [Eric Hartford's WizardLM Uncensored  Falcon 40B](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-40b).
+These files will **not** work in llama.cpp, text-generation-webui or KoboldCpp.
+GGCC is a new format created in a new fork of llama.cpp that introduced this new Falcon GGML-based support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
+Currently these files will also not work with code that previously supported Falcon, such as LoLLMs Web UI and ctransformers. But support should be added soon.
+For GGMLv3 files compatible with those UIs, [please see the old `ggmlv3` branch](https://huggingface.co/TheBloke/falcon-40b-instruct-GGML/tree/ggmlv3).
 ## Repositories available
 <!-- compatibility_ggml start -->
 ## Compatibility
+To build cmp-nct's fork of llama.cpp with Falcon support plus CUDA acceleration, please try the following steps:
 ```
 git clone https://github.com/cmp-nct/ggllm.cpp
 `-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU.  If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
+Please see https://github.com/cmp-nct/ggllm.cpp for further details and instructions.
 <!-- compatibility_ggml end -->
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
+| wizard-falcon40b.ggccv1.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | Uses GGML_TYPE_Q2_K for all tensors. |
+| wizard-falcon40b.ggccv1.q3_K.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q3_K for all tensors |
+| wizard-falcon40b.ggccv1.q4_K.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | Uses GGML_TYPE_Q4_K for all tensors |
+| wizard-falcon40b.ggccv1.q5_K.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | Uses GGML_TYPE_Q5_K for all tensors |
+| wizard-falcon40b.ggccv1.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
+| wizard-falcon40b.ggccv1.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.