TheBloke commited on
Commit
c3e22fd
1 Parent(s): 4e3bcb4

Initial GPTQ model commit

Browse files
Files changed (1) hide show
  1. README.md +12 -2
README.md CHANGED
@@ -1,4 +1,14 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
2
  inference: false
3
  language:
4
  - en
@@ -37,7 +47,7 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
37
 
38
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
39
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
40
- * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-13b-hf)
41
 
42
  ## Prompt template: None
43
 
@@ -56,7 +66,7 @@ Each separate quant is in a different branch. See below for instructions on fet
56
  | main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
57
  | gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
58
  | gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
59
- | gptq-4bit-128g-actorder_True | 4 | 128 | True | TBC| True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
60
 
61
  ## How to download from branches
62
 
 
1
  ---
2
+ extra_gated_button_content: Submit
3
+ extra_gated_description: This is a form to enable access to Llama 2 on Hugging Face
4
+ after you have been granted access from Meta. Please visit the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads)
5
+ and accept our license terms and acceptable use policy before submitting this form.
6
+ Requests will be processed in 1-2 days.
7
+ extra_gated_fields:
8
+ ? I agree to share my name, email address and username with Meta and confirm that
9
+ I have already been granted download access on the Meta website
10
+ : checkbox
11
+ extra_gated_heading: Access Llama 2 on Hugging Face
12
  inference: false
13
  language:
14
  - en
 
47
 
48
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
49
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
50
+ * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-13b-hf)
51
 
52
  ## Prompt template: None
53
 
 
66
  | main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
67
  | gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
68
  | gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
69
+ | gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
70
 
71
  ## How to download from branches
72