File size: 8,569 Bytes
21730fd 14730e9 21730fd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
---
license: other
license_name: yi-license
license_link: https://huggingface.co/01-ai/Yi-34B/blob/main/LICENSE
language:
- en
library_name: transformers
base_model: []
tags:
- mergekit
- merge
- Yi
- exllama
- exllamav2
- exl2
---
## Warning
This quant is cursed and most likely not 4.65bpw, even though I ran the standard script to do the quant. Still investigating into it. It does work, though! Apologies for inconvenience.
---
# RPMerge
A merge of several Yi 34B models with a singular goal: 40K+ context, instruct-enhanced storytelling.
Disappointed with some quirks of my previous kitchen sink merges (like token/instruct formats from various models showing up when they shouldn't), I've gone 'back to the basics' and picked a few Vicuna-format only models:
- [DrNicefellow/ChatAllInOne-Yi-34B-200K-V1](https://huggingface.co/DrNicefellow/ChatAllInOne-Yi-34B-200K-V1) and [migtissera/Tess-34B-v1.5b](https://huggingface.co/migtissera/Tess-34B-v1.5b) both have excellent general instruction-following performance.
- [cgato/Thespis-34b-v0.7](https://huggingface.co/cgato/Thespis-34b-v0.7) is trained on the "Username: {Input} / BotName: {Response}" format, to emphasize it in the merge (but not force it). It also seems to work for multi-character stories.
- [Doctor-Shotgun/limarpv3-yi-llama-34b-lora](https://huggingface.co/Doctor-Shotgun/limarpv3-yi-llama-34b-lora) is trained on roleplaying data, but merged at a modest weight to not over emphasize it. This is the only non-vicuna model (being alpaca format), but it doesn't seem to interefere with the Vicuna format or adversely affect long-context perplexity
- [adamo1139/yi-34b-200k-rawrr-dpo-2](https://huggingface.co/adamo1139/yi-34b-200k-rawrr-dpo-2) the base for the limarp lora, this is base Yi gently finetuned to discourage refusals.
- [migtissera/Tess-M-Creative-v1.0](https://huggingface.co/migtissera/Tess-M-Creative-v1.0) and [NousResearch/Nous-Capybara-34B](https://huggingface.co/NousResearch/Nous-Capybara-34B) are both "undertrained" Yi models. I find they excel at raw completion performance (like long novel continuations) while still retaining some Vicuna instruct ability. This may be why some still prefer the original Tess 1.0/Capybara merge.
I consider this a more "focused" merge that previous ones. I will investigate other models (perhaps chatML models?) for a more "factual assistant" focused merge, as well as a coding-focused merge if I can't find one to suit my needs.
## Prompt template: Orca-Vicuna
```
SYSTEM: {system_message}
USER: {prompt}
ASSISTANT:
```
Raw prompting as described here is also effective: https://old.reddit.com/r/LocalLLaMA/comments/18zqy4s/the_secret_to_writing_quality_stories_with_llms/
As well as a very explicit system prompt like this: https://old.reddit.com/r/LocalLLaMA/comments/1aiz6zu/roleplaying_system_prompts/koygiwa/
## Running
Chinese models with large tokenizer vocabularies like Yi need *careful* parameter tuning due to their huge logit sampling "tails." Yi in particular also runs relatively "hot" even at lower temperatures.
I am a huge fan of Kalomaze's quadratic sampling (shown as "smoothing factor" where available), as described here: https://github.com/oobabooga/text-generation-webui/pull/5403
Otherwise, I recommend a lower temperature with 0.1 or higher MinP, a little repetition penalty, and mirostat with a low tau, and no other samplers. See the explanation here: https://github.com/ggerganov/llama.cpp/pull/3841
@MarinaraSpaghetti has extensively tested the model and recommended the following settings. They seem to work quite well:
```
{
"temp": 1,
"temperature_last": true,
"top_p": 1,
"top_k": 0,
"top_a": 0,
"tfs": 1,
"epsilon_cutoff": 0,
"eta_cutoff": 0,
"typical_p": 0.9,
"min_p": 0,
"rep_pen": 1.1,
"rep_pen_range": 19456,
"no_repeat_ngram_size": 0,
"penalty_alpha": 0,
"num_beams": 1,
"length_penalty": 0,
"min_length": 0,
"encoder_rep_pen": 1,
"freq_pen": 0,
"presence_pen": 0,
"do_sample": true,
"early_stopping": false,
"dynatemp": false,
"min_temp": 1,
"max_temp": 2,
"dynatemp_exponent": 1,
"smoothing_factor": 0.33,
"add_bos_token": false,
"truncation_length": 2048,
"ban_eos_token": false,
"skip_special_tokens": true,
"streaming": true,
"mirostat_mode": 0,
"mirostat_tau": 5,
"mirostat_eta": 0.1,
"guidance_scale": 1,
"negative_prompt": "",
"grammar_string": "",
"banned_tokens": "",
"ignore_eos_token_aphrodite": false,
"spaces_between_special_tokens_aphrodite": true,
"sampler_order": [
6,
0,
1,
3,
4,
2,
5
],
"logit_bias": [],
"n": 1,
"rep_pen_size": 0,
"genamt": 400,
"max_length": 38912
}
```
24GB GPUs can efficiently run Yi-34B-200K models at **40K-90K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/). Empty 16GB GPUs can still run the high context with aggressive quantization.
To load/train this in full-context backends like transformers, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM! I do not recommend running high context without context-efficient backends that support flash attention + 8 bit kv cache, like exllamav2, litellm, vllm or unsloth.
## Testing Notes
Thanks to ParasiticRogue for this idea of a Vicuna-only merge, see: https://huggingface.co/brucethemoose/jondurbin_bagel-dpo-34b-v0.2-exl2-4bpw-fiction/discussions
See: https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megamerge-v8#testing-notes
This is a possible base for a storytelling finetune/LASER in the future, once I can bite the bullet and rent some A100s or a MI300.
I have tested this merge with with novel-style continuation (but not much chat-style roleplay), and some assistant-style responses and long context analysis. I haven't seen any refusals so far.
## Merge Details
### Merge Method
This model was merged using the [DARE](https://arxiv.org/abs/2311.03099) [TIES](https://arxiv.org/abs/2306.01708) merge method using /home/alpha/Models/Raw/chargoddard_Yi-34B-200K-Llama as a base.
### Models Merged
The following models were included in the merge:
* /home/alpha/Models/Raw/migtissera_Tess-34B-v1.5b
* /home/alpha/Models/Raw/migtissera_Tess-M-Creative-v1.0
* /home/alpha/Models/Raw/cgato_Thespis-34b-DPO-v0.7
* /home/alpha/Models/Raw/Nous-Capybara-34B
* /home/alpha/Models/Raw/admo_limarp
* /home/alpha/Models/Raw/DrNicefellow_ChatAllInOne-Yi-34B-200K-V1
### Configuration
The following YAML configuration was used to produce this model:
```yaml
models:
- model: /home/alpha/Models/Raw/chargoddard_Yi-34B-200K-Llama
# No parameters necessary for base model
- model: /home/alpha/Models/Raw/migtissera_Tess-34B-v1.5b
#Emphasize the beginning of Vicuna format models
parameters:
weight: 0.19
density: 0.59
- model: /home/alpha/Models/Raw/Nous-Capybara-34B
parameters:
weight: 0.19
density: 0.55
# Vicuna format
- model: /home/alpha/Models/Raw/migtissera_Tess-M-Creative-v1.0
parameters:
weight: 0.05
density: 0.55
- model: /home/alpha/Models/Raw/DrNicefellow_ChatAllInOne-Yi-34B-200K-V1
parameters:
weight: 0.19
density: 0.55
- model: adamo1139/yi-34b-200k-rawrr-dpo-2+Doctor-Shotgun/limarpv3-yi-llama-34b-lora
parameters:
weight: 0.19
density: 0.48
- model: /home/alpha/Models/Raw/cgato_Thespis-34b-DPO-v0.7
parameters:
weight: 0.19
density: 0.59
merge_method: dare_ties
tokenizer_source: union
base_model: /home/alpha/Models/Raw/chargoddard_Yi-34B-200K-Llama
parameters:
int8_mask: true
dtype: bfloat16
```
## Self Promotion
I'm part of a AI startup called Holocene AI!
We're new, busy, and still setting things up. But if you have any business inquiries, want a job, or just want some consultation, feel free to shoot me an email. We have expertise in RAG applications and llama/embeddings model finetuning, and absolutely *none* of the nonsense of scammy AI startups.
Contact me at: [email protected]
I also set up a Ko-Fi! I want to run some (personal) training/LASERing as well, at 100K context or so. If you'd like to buy me 10 minutes on an A100 (or 5 seconds on an MI300X), I'd appreciate it: https://ko-fi.com/alphaatlas |