Model creating gibberish after certain amount of tokens.

#2
by AIGUYCONTENT - opened

Thank you for making this quant---it forced me to dig through my closet and find my old 4080 to push me across the threshold of being able to run it. I currently have 136GB of VRAM.
I got it loaded up in Oobabooga and had a conversation with it. The model seems much better and more coherent/intelligent than Q6 GGUF quants.

With that being said, I'm currently ~8k tokens into a conversation (writing business content for a website) with this AI model and it has shit the bed. It's started to produce gibberish content and long blocks of text when I hit the 8k token mark. I know each model counts tokens differently...so give or take.

Do you have any suggestions on how to fix or what settings to use in Oobabooga? I normally use GGUFs...but I saw your 8.0bpw quant and I had to see if I could run it.

I also have an old 3080 collecting dust in another closet that I can throw into the mix if you feel I need more VRAM to run this model. That would bring me to 146GB of VRAM. Unsure if that would slow inferencing to a crawl considering everything else is 3090/4090 and one 4080.

thanks!

Hi! That's odd, this is the one I use and I have filled the context without that issue. Well, except for the issues all models have at higher context. While I cannot perfectly replicate your setup from a hardware perspective (I use 3xA40s) I could try to see if I get the same thing mirroring your other settings (exllamav2 options in ooba and samplers)

Thanks. Is there any info or screenshots you'd like me to provide?

I found your 8.0bpw quant to be far superior to the other GGUF/EXL2/AWQ quants out there---I mean this in terms of intelligence and overall ability to act like a human being and help me solve work problems.

Or could you tell me the settings you use? I'm using the Mistral template as well.

And wow....nice rig. I'm on a budget so I'm using my gaming 4090 and the old 4080 that I was supposed to sell on eBay. Trying to get my rig to where it will eventually run 10x3090s at once.

I wish I owned A40s but I don't. I spin up a RunPod session when I want to use one of my quants.
I use the Alpaca format for Mistral Magnum. Recently I have been trying Alpaca, Mistral, and Vicuna with every model and seeing which gives me the output I like best. I don't think that would be sufficient to make it output garbage, though, after 8k.
Do you use the chat interface on ooba? That might be another difference between how I use it and you. I use Sillytavern as a front end.

Screenshot_20240913_065852_Chrome.jpg

Screenshot_20240913_065902_Chrome.jpg

Screenshot_20240913_065913_Chrome.jpg

The big ones are I lower context to 65536 to get it to fit and I DO NOT use cache_8bit or cache_4bit .. I have nothing but trouble with those on all output from EXL2 quants

I am at 12k of a test run and so far no gibberish. I am using my original quant. I am downloading the one from HF in case it got corrupted somehow.

Well, that's not it either 15k and going strong.
Another difference could be that the only samplers I ever use are Temp and MinP? Depending on how creative vs deterministic I need it to be will determine the ratio between the two

Ok, do I need to trust remote code for this quant?

I'm currently re-downloading your quant. I also installed this: ExllamaV2 tensor parallelism for OOB V1.14 (https://github.com/RandomInternetPreson/TextGenTips?tab=readme-ov-file) and I'm really hoping to get it working with your quant. However, I have four 3090s, a 4090, and a 4080. So unsure if it will work because I read somewhere it needs GPUs in quantities of 4s for TP to work?

I also just discovered that I had a CUDA mismatch issue, which I have since solved. So, I wonder if that somehow contributed?

I will respond back to this thread after I'm finished re-downloading this quant and testing it out with your settings.

BTW....do you use a web search extension for Oooba? That and the lack of a PDF upload ability is causing me to look at Aphrodite + OpenWeb UI. But it's a pain in the ass to install both of those....and changing models via Aphrodite is also a royal pain in the ass. But I would prefer to run the new verion of Tabby for EXL2 files because it supports the new change that Turboderp did by enabling TP.

No you don't need trust remote code. I just tried it now without.
Here's my current configuration
Ubuntu 22.04 LTS
CUDA 12.1.1
Python 3.11.9
Text Generation Web UI v1.13
Legacy API Extension
Torch 2.2.2

I haven't tried any web search extension. I know there is at least one for ST I just haven't tried it.

Let me know how it goes!

So I was unable to get it working with CFG_CACHE turned on. But I had to leave and turned off the server. I'm back now and I cannot get it to load for anything. It goes CUDA OOM right here:

edit I forgot this isn't github. The table below is hard to read. The last 3090 only has 2.7GB used out of 24 when it goes OOM.

me@pop-os:~$ nvidia-smi
Fri Sep 13 20:18:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02 Driver Version: 555.58.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 24C P5 28W / 250W | 23658MiB / 24576MiB | 16% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:02:00.0 On | Off |
| 80% 27C P0 77W / 250W | 24108MiB / 24564MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 Off | 00000000:45:00.0 Off | N/A |
|100% 24C P8 49W / 295W | 23221MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4080 Off | 00000000:81:00.0 Off | N/A |
| 80% 23C P8 20W / 275W | 15429MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 3090 Off | 00000000:C1:00.0 Off | N/A |
|100% 32C P0 107W / 350W | 23637MiB / 24576MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 3090 Off | 00000000:C2:00.0 Off | N/A |
|100% 28C P0 127W / 420W | 2729MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

Already tried clearing the cache. Context is 7536 and nothing else is checked. I wonder if CFG_Cache really sucks down VRAM....because the model is ~100GB and I have 136GB of VRAM and some RAM to spare.

I just tried it with CFG_CACHE on and I get the same thing. Loading it with it off I have 33GB left with 65k context.. Turn it on and I am OOM

Looks like it's an auto split issue. If I manually enter the GPU split, saving room for context and CFG cache it works

I tuned off AUTO_Split and entered 19,15,22,22,23,23 and it loaded. However, I had to lower the context to 5096. Otherwise it would go OOM.

Do you think this is the best I can do or should I play around with the GPU_Split numbers more? I tried everything, including 21,15,23,23,23,23 (the "15" is the 4080) and even then it went OOM.

And the other weird thing is that even though it loaded, it's not fully utilizing the VRAM.

Nvidia-SMI is reporting:

GPU1: 22/24GB
GPU2: 22/24GB
GPU3: 23/24GB
GPU4: 15/16GB
GPU5: 23/24GB
GPU6: 6/24GB

Shouldn't I be able to increase the context due to the VRAM that's not being used?

It's not that huge of an issue...and I can just delete the convo and start a new one when it gets too long.

I did some digging and I found this:

https://www.reddit.com/r/Oobabooga/comments/15wvyc2/context_length_in_llms_all_you_need_to_know/
https://www.reddit.com/r/LocalLLaMA/comments/14j4l7h/6000_tokens_context_with_exllama/

https://github.com/oobabooga/text-generation-webui/pull/2875
https://agi-sphere.com/context-length/
https://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab

I don't have time this morning to dig deeper than that....but those threads are from ~1 year ago and they mention something about using compress_pos_emb or alpha_value (one or the other but not both) to extend context length based on some "discovery."

I have never used those parameters and my models have run just fine. Will be experimenting with both later on today. But for now I will just have to be stuck with 5k context with this model because that's what seems to work the best. But I know I'm leaving performance on the table with only 5k context length....considering the model is 100GB and I have 136GB of VRAM.

BTW...how much do you know how to save for context and CFG_Cache?

I just got it to load successfully:

GPU_Split: 19,15,22,22,23,23
max_seq_len: 5036
alpha_value: 1.5

However, the last GPU is only using 6.5GB/24GB of VRAM.

I have never used Alpha_value before...but https://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab states that we can:

Used to extend the context length of a model with a minor loss in quality. I have measured 1.75 to be optimal for 1.5x context, and 2.5 for 2x context. That is, with alpha = 2.5 you can make a model with 4096 context length go to 8192 context length.

A bit worried about the 'minor loss in quality' part. But this is the best I can do for now.

Just tested it out for a bit...unsure if it's a bit dumber. I asked it to make two corrections to the grammar of 3 sentences. It kept fixing one and ignoring the other.

And I know that it's extremely challenging for models to continually adhere to strict grammar rules and a specific writing style.

I'm just trying to get your model working best as I can.

I am now going to start teaching myself how to fine tune. I have 5 years of content that I have written. I can feed that to the AI and hopefully be able to fine tune it.

Ultimate goal is to have the AI write like me and reduce the time it takes to write content. aka "Time is money."

Gotcha... My personal experience with Magnum is that it has a unique style, but it is 'dumber' than Mistral Large or Luminum.

As for utilization of VRAM they all seem to leave a 'buffer' I never get more than 97% utilization on a 48gb GPU

I don't know if you guys ever figured out a proper solution but I too am having the issue where after around 15k context it starts to lose its mind. It's not just this quant or anything but every fine tune that has Magnum 123b in it. I found out its probably because it was trained on 8192 ctx instead of the full 131072 I think the amount was. It's quite a damn shame since I really really love this model but it being trained on such low ctx sucks especially when the original model has far more. I just use min_p preset along with dry sampler set at 0.8. This is on dual A100s with autosplit and nothing else.
https://huggingface.co/anthracite-org/magnum-v2-123b/discussions/8

I don't know if you guys ever figured out a proper solution but I too am having the issue where after around 15k context it starts to lose its mind. It's not just this quant or anything but every fine tune that has Magnum 123b in it. I found out its probably because it was trained on 8192 ctx instead of the full 131072 I think the amount was. It's quite a damn shame since I really really love this model but it being trained on such low ctx sucks especially when the original model has far more. I just use min_p preset along with dry sampler set at 0.8. This is on dual A100s with autosplit and nothing else.
https://huggingface.co/anthracite-org/magnum-v2-123b/discussions/8

This fascinates me. I literally used this 8bpw quant yesterday and the day before for about 6 hours and ran it all the way to 64k without issue... sure it gets dumber fast past 32k but still functional. It's definitely 'dumber' than all the other tunes, but I don't get gibberish. Personally, I only use temp and minp...or sometimes smoothing by itself. I also don't use Mistral prompts on Mistral models. Vicuna or Alpaca. Not sure if any of that makes a difference.
No doubt this model has a writing style all its own. Some of the stuff has made me laugh out loud.

Sign up or log in to comment