TheBloke
/

falcon-40b-instruct-GGML

Model card Files Files and versions Community

TheBloke commited on Jun 18, 2023

Commit

d11efd9

•

1 Parent(s): 303874e

Update README.md

Files changed (1) hide show

README.md +8 -4

README.md CHANGED Viewed

@@ -39,7 +39,7 @@ The can be used with a new fork of llama.cpp that adds Falcon GGML support: [cmp
 <!-- compatibility_ggml start -->
 ## Compatibility
-To build cmp-nct's fork of llama.cpp with Falcon 40B support plus preliminary CUDA acceleration, please follow the following steps:
 ```
 git clone https://github.com/cmp-nct/ggllm.cpp
@@ -48,12 +48,16 @@ git checkout cuda-integration
 rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
 ```
-You can then use `bin/falcon_main` just like you would use llama.cpp. For example:
 ```
-bin/falcon_main -t 1 -ngl 100 -m /workspace/wizard-falcon40b.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
 ```
-As with llama.cpp, if you can fully offload the model to VRAM you should use `-t 1` for maximum performance.  If not, use more threads, eg `-t 8`.
 <!-- compatibility_ggml end -->

 <!-- compatibility_ggml start -->
 ## Compatibility
+To build cmp-nct's fork of llama.cpp with Falcon 40B support plus preliminary CUDA acceleration, please try the following steps:
 ```
 git clone https://github.com/cmp-nct/ggllm.cpp
 rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
 ```
+Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VScode. When compiling with CUDA support using the Microsoft compiler it's essential to select the "Community edition build tools". Otherwise CUDA won't compile.'
+Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
 ```
+bin/falcon_main -t 8 -ngl 100 -m /workspace/wizard-falcon40b.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
 ```
+Using `-ngl 100` will offload all layers to GPU. If you do not have enough VRAM for this, either lower the number or try a smaller quant size as otherwise performance will be severely affected.
+Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
 <!-- compatibility_ggml end -->