Text Generation
Transformers
llama
Inference Endpoints
bhenrym14 commited on
Commit
e2e4843
1 Parent(s): 1e5fbc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -8
README.md CHANGED
@@ -31,15 +31,8 @@ This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scal
31
  Each method will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384`. A monkeypatch can be found here.
32
 
33
 
34
-
35
-
36
-
37
- The easiest way is to use [oobabooga text-generation-webui](https://github.com/oobabooga/text-generation-webui) with ExLlama. You'll need to set max_seq_len to 8192 and compress_pos_emb to 4.
38
-
39
- If you wish to use AutoGPTQ/GPTQ-for-Llama instead, you'll need to patch in the appropriate RoPE scaling module. see: [replace_llama_rope_with_scaled_rope](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch.py)
40
-
41
  ## Motivation
42
- Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is an adapter that has been fine-tuned on longer context (8192 tokens); even when applied to models trained on dissimilar datasets, it successfully extends the context window to which the model can attend. While it's impressive this adapter is so flexible, how much does performance suffer relative to a model that has been fine-tuned with the scaled embeddings from the start? This is an experiment to explore this.
43
 
44
  ## Relative Performance (perplexity)
45
  | Model | Context (tokens) | Perplexity |
 
31
  Each method will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384`. A monkeypatch can be found here.
32
 
33
 
 
 
 
 
 
 
 
34
  ## Motivation
35
+ Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation (https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and NTK aware scaling. My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
36
 
37
  ## Relative Performance (perplexity)
38
  | Model | Context (tokens) | Perplexity |