Best model I've found yet, but it completely stops at 4,096 tokens.

by pwroff - opened May 15

Discussion

pwroff

May 15

I'm using LLama 3 settings in LM Studio. My context length is at 8192 like suggested in the model inspector.

mradermacher

Owner May 15

Without knowing how many tokens you configured to generate, that is a pretty useless piece of information. Did you actually configure more than 4096 tokens?

mradermacher changed discussion status to closed May 15

pwroff

May 15

yeah, as I mentioned, I configured 8192.

mradermacher

Owner May 15

You haven't mentioned it yet. Are you confusing this with the context length?

pwroff

May 15

yes because the issue I have is with the total context length, not with a single response. It's configured default at -1.

pwroff

May 15

I changed the rope base value to see if it would make a difference, it stopped a little further but refuses to continue:

mradermacher

Owner May 15

Well, I guess you haven't described your issue then? What does it mean it stops vs. it refuses? These seem to be very different behaviours. And the total context length is not the number of tokens you can generate, so I suspect a misunderstanding on what these parameters mean.

In any case, this is almost certainly not related to this model, maybe you should ask in a lm-studio support group on how to set it up.

pwroff

May 15

Let me rephrase it then, it returns an empty response when the length of the current context is around 4000 tokens. I'll check with LM Studio but it's the only model I know that has this behaviour, hence why I was looking here if there was something I could configure to go past that limit. Maybe it's a VRAM issue, I'll try to fiddle with it to see if it changes anything.

mradermacher

Owner May 15

•

edited May 15

That seems to be normal behaviour - you can't force an llm to just make stuff up once it has generated the text. Very few models are trained to give long responses without prompting (usually only models specialised on story writing or similar tasks). Chat models certainly are not. Or maybe your problem is still different, but just hitting "More" does not normally give you infinite output with any model. The context length has nothing to do with the length of responses.

pwroff

May 15

Tell me if I'm wrong but my understanding of context is : system prompt + all previous prompts and responses.

In my case I had:

system prompt + prompt + response + prompt + response ... continues like that for a while then at about 3900 context length I send a short prompt and the answer will be cut in half in the middle of a phrase once it reaches around 4000 tokens total including the whole context and the current response.

mradermacher

Owner May 15

That's a configuration problem then - if it is mid-sentence, then almost certainly lm studio cuts it off, not the model (which cannot really sense limits imposed by the software). I just tried the IQ3_M with 16k context and a 10k prompt in llama.cpp and it works fine. It should work fine with lm studio, too, because that just uses llama.cpp under the hood unless I am mistaken. Again, you should contact an appropriate support forum for lm studio.

pwroff

May 16

thanks for your time!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment