Initial GPTQ model commit
Browse files
README.md
CHANGED
@@ -1,4 +1,14 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
inference: false
|
3 |
language:
|
4 |
- en
|
@@ -37,7 +47,7 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
|
|
37 |
|
38 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
|
39 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
|
40 |
-
* [
|
41 |
|
42 |
## Prompt template: None
|
43 |
|
@@ -56,7 +66,7 @@ Each separate quant is in a different branch. See below for instructions on fet
|
|
56 |
| main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
57 |
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
58 |
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
59 |
-
| gptq-4bit-128g-actorder_True | 4 | 128 | True |
|
60 |
|
61 |
## How to download from branches
|
62 |
|
|
|
1 |
---
|
2 |
+
extra_gated_button_content: Submit
|
3 |
+
extra_gated_description: This is a form to enable access to Llama 2 on Hugging Face
|
4 |
+
after you have been granted access from Meta. Please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads)
|
5 |
+
and accept our license terms and acceptable use policy before submitting this form.
|
6 |
+
Requests will be processed in 1-2 days.
|
7 |
+
extra_gated_fields:
|
8 |
+
? I agree to share my name, email address and username with Meta and confirm that
|
9 |
+
I have already been granted download access on the Meta website
|
10 |
+
: checkbox
|
11 |
+
extra_gated_heading: Access Llama 2 on Hugging Face
|
12 |
inference: false
|
13 |
language:
|
14 |
- en
|
|
|
47 |
|
48 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
|
49 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
|
50 |
+
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-13b-hf)
|
51 |
|
52 |
## Prompt template: None
|
53 |
|
|
|
66 |
| main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
|
67 |
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
|
68 |
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
69 |
+
| gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
|
70 |
|
71 |
## How to download from branches
|
72 |
|