Edit model card

superhot-7b-8k-4bit-32g-safetensors

Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 8192 (or lower) and 4.

Merged base LLaMA and LoRA with this: https://github.com/tloen/alpaca-lora

Base LLaMA 7B: https://huggingface.co/huggyllama/llama-7b

SuperHOT 7B 8k no-rlhf-test LoRA: https://huggingface.co/kaiokendev/superhot-7b-8k-no-rlhf-test

BASE_MODEL=huggyllama_llama-7b LORA=kaiokendev_superhot-7b-8k-no-rlhf-test python export_hf_checkpoint.py

Quantized with AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ

python quant_with_alpaca.py --pretrained_model_dir superhot-7b-8k-safetensors --quantized_model_dir superhot-7b-8k-no-rlhf-test-32g-GPTQ --bits 4 --group_size 32 --desc_act --num_samples 256 --save_and_reload

Perplexity:

CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \
         -d /workspace/models/superhot-7b-8k-no-rlhf-test-32g-GPTQ \
         -ppl \
         -ppl_ds datasets/wikitext2.txt \
         -l 8192 \
         -cpe 4 \
         -ppl_cn 40 \
         -ppl_cs 8192 \
        -ppl_ct 8192
 -- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 40
 -- - Chunk size: 8192 -> 8192
 -- - Chunk overlap: 0
 -- - Min. chunk size: 50
 -- - Key: text
 -- Tokenizer: /workspace/models/superhot-7b-8k-no-rlhf-test-32g-GPTQ/tokenizer.model
 -- Model config: /workspace/models/superhot-7b-8k-no-rlhf-test-32g-GPTQ/config.json
 -- Model: /workspace/models/superhot-7b-8k-no-rlhf-test-32g-GPTQ/4bit-32g.safetensors
 -- Sequence length: 8192
 -- RoPE compression factor: 4.0
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perplexity']
 ** Time, Load model: 1.64 seconds
 ** Time, Load tokenizer: 0.02 seconds
 -- Groupsize (inferred): 32
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 4,131.34 MB
 -- Loading dataset...
 -- Testing 40 chunks....
 ** Perplexity: 6.3184
Downloads last month
9
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.