Edit model card

superhot-30b-8k-4bit-128g-safetensors

Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 8192 (or lower) and 4.

Merged base LLaMA and LoRA with this: https://github.com/tloen/alpaca-lora

Base LLaMA 30B: https://huggingface.co/huggyllama/llama-30b

SuperHOT 30B 8k no-rlhf-test LoRA: https://huggingface.co/kaiokendev/superhot-30b-8k-no-rlhf-test

BASE_MODEL=huggyllama_llama-30b LORA=kaiokendev_superhot-30b-8k-no-rlhf-test python export_hf_checkpoint.py

Quantized with AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ

python quant_with_alpaca.py --pretrained_model_dir superhot-30b-8k-safetensors --quantized_model_dir superhot-30b-8k-4bit-128g-safetensors --bits 4 --group_size 128 --desc_act --num_samples 256 --save_and_reload

Perplexity:

CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \
         -d /workspace/models/superhot-30b-8k-4bit-128g-safetensors \
         -ppl \
         -ppl_ds datasets/wikitext2.txt \
         -l 8192 \
         -cpe 4 \
         -ppl_cn 40 \
         -ppl_cs 8192 \
         -ppl_ct 8192
 -- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 40
 -- - Chunk size: 8192 -> 8192
 -- - Chunk overlap: 0
 -- - Min. chunk size: 50
 -- - Key: text
 -- Tokenizer: /workspace/models/superhot-30b-8k-4bit-128g-safetensors/tokenizer.model
 -- Model config: /workspace/models/superhot-30b-8k-4bit-128g-safetensors/config.json
 -- Model: /workspace/models/superhot-30b-8k-4bit-128g-safetensors/4bit-128g.safetensors
 -- Sequence length: 8192
 -- RoPE compression factor: 4.0
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perplexity']
 ** Time, Load model: 4.31 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 17,043.70 MB
 -- Loading dataset...
 -- Testing 40 chunks....
 ** Perplexity: 4.6612
Downloads last month
16
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.