Best model I've found yet, but it completely stops at 4,096 tokens.

#1
by pwroff - opened

I'm using LLama 3 settings in LM Studio. My context length is at 8192 like suggested in the model inspector.

Without knowing how many tokens you configured to generate, that is a pretty useless piece of information. Did you actually configure more than 4096 tokens?

mradermacher changed discussion status to closed

yeah, as I mentioned, I configured 8192.

You haven't mentioned it yet. Are you confusing this with the context length?

yes because the issue I have is with the total context length, not with a single response. It's configured default at -1.

I changed the rope base value to see if it would make a difference, it stopped a little further but refuses to continue:

image.png

Well, I guess you haven't described your issue then? What does it mean it stops vs. it refuses? These seem to be very different behaviours. And the total context length is not the number of tokens you can generate, so I suspect a misunderstanding on what these parameters mean.

In any case, this is almost certainly not related to this model, maybe you should ask in a lm-studio support group on how to set it up.

Let me rephrase it then, it returns an empty response when the length of the current context is around 4000 tokens. I'll check with LM Studio but it's the only model I know that has this behaviour, hence why I was looking here if there was something I could configure to go past that limit. Maybe it's a VRAM issue, I'll try to fiddle with it to see if it changes anything.

That seems to be normal behaviour - you can't force an llm to just make stuff up once it has generated the text. Very few models are trained to give long responses without prompting (usually only models specialised on story writing or similar tasks). Chat models certainly are not. Or maybe your problem is still different, but just hitting "More" does not normally give you infinite output with any model. The context length has nothing to do with the length of responses.

Tell me if I'm wrong but my understanding of context is : system prompt + all previous prompts and responses.

In my case I had:

system prompt + prompt + response + prompt + response ... continues like that for a while then at about 3900 context length I send a short prompt and the answer will be cut in half in the middle of a phrase once it reaches around 4000 tokens total including the whole context and the current response.

That's a configuration problem then - if it is mid-sentence, then almost certainly lm studio cuts it off, not the model (which cannot really sense limits imposed by the software). I just tried the IQ3_M with 16k context and a 10k prompt in llama.cpp and it works fine. It should work fine with lm studio, too, because that just uses llama.cpp under the hood unless I am mistaken. Again, you should contact an appropriate support forum for lm studio.

thanks for your time!

Sign up or log in to comment