README.md · flashvenom/Airoboros-13B-SuperHOT-8K-4bit-GPTQ at main

Model upload of Airoboros-13B-SuperHOT in 4-bit GPTQ version, converted using GPTQ-for-LLaMa; Source model from https://huggingface.co/Peeepy/Airoboros-13b-SuperHOT-8k.

This uses the Airoboros-13B(v1.2) model and applies the SuperHOT 8K LoRA on top, allowing for improved coherence at larger context lenghts, as well as improving output quality of Airoboros to be more verbose.

You will need a monkey-patch at inference to use the 8k context, please see patch file present, if you are using a different inference engine (like llama.cpp / exllama) you will need to add the monkey patch there.

Note: If you are using exllama the monkey-patch is built into the engine, please use -cpe to set the scaling factor, ie. if you are running it at 4k context, pass `-cpe 2 -l 4096`

Patch file present in repo or can be accessed here: https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test/raw/main/llama_rope_scaled_monkey_patch.py

This uses the Airoboros-13B(v1.2) model and applies the SuperHOT 8K LoRA on top, allowing for improved coherence at larger context lenghts, as well as improving output quality of Airoboros to be more verbose.

Note: If you are using exllama the monkey-patch is built into the engine, please use -cpe to set the scaling factor, ie. if you are running it at 4k context, pass -cpe 2 -l 4096

Note: If you are using exllama the monkey-patch is built into the engine, please use -cpe to set the scaling factor, ie. if you are running it at 4k context, pass `-cpe 2 -l 4096`