Update README.md
Browse files
README.md
CHANGED
@@ -25,9 +25,12 @@ license: apache-2.0
|
|
25 |
|
26 |
These files are **experimental** GGML format model files for [Falcon 40B Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
|
27 |
|
28 |
-
|
29 |
|
30 |
-
They can be used with
|
|
|
|
|
|
|
31 |
|
32 |
## Repositories available
|
33 |
|
@@ -35,11 +38,15 @@ They can be used with a new fork of llama.cpp that adds Falcon GGML support: [cm
|
|
35 |
* [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
|
36 |
* [2, 3, 4, 5, 6, 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GGML)
|
37 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
|
38 |
-
|
39 |
<!-- compatibility_ggml start -->
|
40 |
## Compatibility
|
41 |
|
42 |
-
|
|
|
|
|
|
|
|
|
43 |
|
44 |
```
|
45 |
git clone https://github.com/cmp-nct/ggllm.cpp
|
@@ -51,7 +58,7 @@ Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VS
|
|
51 |
|
52 |
Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
|
53 |
```
|
54 |
-
bin/falcon_main -t 8 -ngl 100 -b 1 -m
|
55 |
```
|
56 |
|
57 |
You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available to be used.
|
|
|
25 |
|
26 |
These files are **experimental** GGML format model files for [Falcon 40B Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
|
27 |
|
28 |
+
They cannot be used with text-generation-webui, llama.cpp, or KoboldCpp at this time.
|
29 |
|
30 |
+
They can be used with:
|
31 |
+
* [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui).
|
32 |
+
* The ctransformers Python library, which includes LangChain support: [ctransformers](https://github.com/marella/ctransformers).
|
33 |
+
* A new fork of llama.cpp that introduced this new Falcon GGML support: [cmp-nc/ggllm.cpp](https://github.com/cmp-nct/ggllm.cpp).
|
34 |
|
35 |
## Repositories available
|
36 |
|
|
|
38 |
* [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
|
39 |
* [2, 3, 4, 5, 6, 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GGML)
|
40 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
|
41 |
+
|
42 |
<!-- compatibility_ggml start -->
|
43 |
## Compatibility
|
44 |
|
45 |
+
The recommended UI for these GGMLs is [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui). Preliminary CUDA GPU acceleration is provided.
|
46 |
+
|
47 |
+
For use from Python code, use [ctransformers](https://github.com/marella/ctransformers). Again, with preliminary CUDA GPU acceleration
|
48 |
+
|
49 |
+
Or to build cmp-nct's fork of llama.cpp with Falcon 7B support plus preliminary CUDA acceleration, please try the following steps:
|
50 |
|
51 |
```
|
52 |
git clone https://github.com/cmp-nct/ggllm.cpp
|
|
|
58 |
|
59 |
Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
|
60 |
```
|
61 |
+
bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon7b-instruct.ggmlv3.q4_0.bin -p "What is a falcon?\n### Response:"
|
62 |
```
|
63 |
|
64 |
You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available to be used.
|