File size: 1,991 Bytes
6c3a855
 
 
2d09b37
 
eb12e5f
6c3a855
7cc788c
 
 
 
 
 
6c3a855
7cc788c
2d09b37
 
 
 
 
7cc788c
 
2d09b37
 
 
 
7cc788c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c3a855
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
license: other
---
# superhot-30b-8k-4bit-128g-safetensors

**Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 8192 (or lower) and 4.**

Merged base LLaMA and LoRA with this:
https://github.com/tloen/alpaca-lora

Base LLaMA 30B:
https://huggingface.co/huggyllama/llama-30b

SuperHOT 30B 8k no-rlhf-test LoRA:
https://huggingface.co/kaiokendev/superhot-30b-8k-no-rlhf-test

``` sh
BASE_MODEL=huggyllama_llama-30b LORA=kaiokendev_superhot-30b-8k-no-rlhf-test python export_hf_checkpoint.py
```

Quantized with AutoGPTQ:
https://github.com/PanQiWei/AutoGPTQ

``` sh
python quant_with_alpaca.py --pretrained_model_dir superhot-30b-8k-safetensors --quantized_model_dir superhot-30b-8k-4bit-128g-safetensors --bits 4 --group_size 128 --desc_act --num_samples 256 --save_and_reload
```

Perplexity:
```
CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \
         -d /workspace/models/superhot-30b-8k-4bit-128g-safetensors \
         -ppl \
         -ppl_ds datasets/wikitext2.txt \
         -l 8192 \
         -cpe 4 \
         -ppl_cn 40 \
         -ppl_cs 8192 \
         -ppl_ct 8192
 -- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 40
 -- - Chunk size: 8192 -> 8192
 -- - Chunk overlap: 0
 -- - Min. chunk size: 50
 -- - Key: text
 -- Tokenizer: /workspace/models/superhot-30b-8k-4bit-128g-safetensors/tokenizer.model
 -- Model config: /workspace/models/superhot-30b-8k-4bit-128g-safetensors/config.json
 -- Model: /workspace/models/superhot-30b-8k-4bit-128g-safetensors/4bit-128g.safetensors
 -- Sequence length: 8192
 -- RoPE compression factor: 4.0
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perplexity']
 ** Time, Load model: 4.31 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 17,043.70 MB
 -- Loading dataset...
 -- Testing 40 chunks....
 ** Perplexity: 4.6612
```