bhenrym14
/

airophin-13b-pntk-16k-GPTQ

Text Generation

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 21, 2023

Commit

e2e4843

•

1 Parent(s): 1e5fbc7

Update README.md

Files changed (1) hide show

README.md +1 -8

README.md CHANGED Viewed

@@ -31,15 +31,8 @@ This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scal
 Each method will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384`. A monkeypatch can be found here.
-The easiest way is to use [oobabooga text-generation-webui](https://github.com/oobabooga/text-generation-webui) with ExLlama. You'll need to set max_seq_len to 8192 and compress_pos_emb to 4.
-If you wish to use AutoGPTQ/GPTQ-for-Llama instead, you'll need to patch in the appropriate RoPE scaling module. see: [replace_llama_rope_with_scaled_rope](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch.py)
 ## Motivation
-Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is an adapter that has been fine-tuned on longer context (8192 tokens); even when applied to models trained on dissimilar datasets, it successfully extends the context window to which the model can attend. While it's impressive this adapter is so flexible, how much does performance suffer relative to a model that has been fine-tuned with the scaled embeddings from the start? This is an experiment to explore this.
 ## Relative Performance (perplexity)
 | Model                                                | Context (tokens)     | Perplexity |

 Each method will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384`. A monkeypatch can be found here.
 ## Motivation
+Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation (https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and NTK aware scaling. My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
 ## Relative Performance (perplexity)
 | Model                                                | Context (tokens)     | Perplexity |