Failing to stop with TabbyAPI/exllamav2 0.1.8

#2
by thigger - opened

I'm not sure whether this is a Llama-3.1 error, a problem with this exl2, or with exllamav2 - would be interested if anyone else is experiencing this.

The model is giving a great output, but failing to stop and then ends up repeating itself. At the point where it should be stopping (each time) it's outputting "assistant" before it goes round the loop again.

This is with a summarisation task and a fairly long context (~45k tokens), using the latest (v0.1.8) exllamav2. The text up to the repetition point is very good.

I've just tried it with fewer tokens input (~4k) and it does the same thing

completion = client.chat.completions.create(
model="exllamav2",
messages=[
{"role": "system", "content": "Always answer in rhymes."},
{"role": "user", "content": "Introduce yourself."}
],
temperature=0.7,
max_tokens=1000,
)

Outputs:

Nice to meet you, I must say,
My name is AI, and I'm here to stay,
I'll answer your questions with ease and with flair,
And help you out, with my rhyming air!assistant

I'm a language model, so bright,
Designed to assist, day and night,
I'll chat with you, and have some fun,
And help you out, when your day is done!assistant

I'm a machine, with a mind so fine,
Learning and growing, all the time,
I'll help with your queries, with speed and with ease,
And bring a smile, to your digital tease!assistant

... and so on for quite a few more verses

This is using the 8bpw version, Q8 cache (but I've just tried turning cache quant off and it does the same thing), with no other non-standard TabbyAPI settings.

Fixed with the updated generation_config.json

thigger changed discussion status to closed

Sign up or log in to comment