Feedback and samplers (it's surprisingly great in a very subtle way)

by BigBeavis - opened Sep 19

Sep 19

•

I've played around with the 12B GGUF Q5_K_M version and here are my observations.

Settings:
I'm using a customized Mistral 12b formatting and text completion preset (https://qu.ax/zaWP.json and https://qu.ax/dTZc.json). So far, it's the best performing preset i've tried, it's definitely way better than Sillytavern's default Mistral preset. It works just as well with this model. I'm also using 12k context.

Setup:
I've tested this in a semi-complicated roleplay scenario that has several characters and multiple plot threads. The story features several characters (from Chainsaw Man) that are not in a group chat (only 1 character card that houses them all), and a giant lorebook describing the world. They and me go into 'the backrooms' on an expedition. That's the setup. The story gets quite stressful, everyone is on edge, radio communication with other teams stops working, there is a disagreement about whether to split up into two teams or stick together.

I use this chat as a baseline to compare different models all the time. Instead of judging them "by feel", i can spot objective differences. Did it remember about every character in the team after X messages? Did it properly remember to stick together after the argument, or randomly forgot about it and split us up anyway? How well does it portray every character's personality? Does it portray group dynamics well or just sticks to generic one-liners?
For this model, for me, the main comparison was against NemoMix Unleashed 12B, as that is what i was using prior. I essentially wanted to see if this model retained the strengths of the base model like NemoMix Unleashed did, and if the text it generates is actually better or more or less the same.

Observations:
The model remembered to keep including every character from the group as well enough as Nemomix Unleashed. It also remembered about the presence of other teams. Basically, so far this just means that it performs up to general expectation of a 12B Nemo model.

In the "lets split up after all as if that was the plan all along" test, RPMax did it more often than NemoMix Unleashed (it's bad when a model does it). But it wasn't prevalent, maybe in 1 out of 4 generations in general. In comparison, NemoMix Unleashed did it even less often, which surprised me before. But here it's still within acceptable range for a 12B finetune, as regeneration mostly fixes it.

In random sections of the chat, RPMax would generate similar responses in meaning to NemoMix Unleashed, but the text is somewhat less raw, not exactly poetic, but it flows better, has a bit more flair to it. Nemomix Unleashed is quite blunt in comparison. RPMax also seemingly better describes emotion, and i'd even say makes the characters in general somewhat more emotive, which is an improvement, although not quantifiable. I wouldn't say it's a night and day difference, but it's definitely noticeable, even if it's subtle.

The differences are in subtle details, and this is where RPMax suddenly did something NemoMix Unleashed couldn't. In some cases, RPMax just seems to understand the overall context of the scene better and thus sticks to the mood with more precision. Sort of like it reads between the lines 1 level deeper. It's not something that amounts to an actual blunder or mistake if a model misses it, but when you see a model not miss it compared to one that missed, you immediately go - "Aha! It gets it!"

Examples (i'm obviously not going to include dozens of screenshots, so the choice of examples is a bit curated, but it's also not like i'm set out to make NemoMix Unleashed look worse on purpose, it truly isn't hard to get a generally better output with RPMax than NemoMix)

Here's a VERY good example. In the chat, after the radio goes silent, i tell everyone to take out the batteries from their radios. The female leader generally goes "Wtf for?". This is a checkpoint where RPMax shines over Nemomix, as it understood by itself that i must be thinking about the possibility of the radios being tracked. I had to explain that to EVERY other 12b and 8b model i've tried! I don't know how, but RPMax KNOWS what's up in that moment - the lead female goes "Wtf for?...... Aaaaah, smaaaart, i see what you're thinking, User... It's so 'they' can't track us!". Now, it only does that about half the time, but it's so awesome that it does it at all. And i'll reiterate - Nemomix Unleashed 12B doesn't get it! Llama 8b models don't get it! RPMax 12B was the first one i came across in this size range that got it.

RPMax just gets it...

NemoMix is just confused (understandably so, but still)

But that's not even all there is to it yet, no. After that, she goes "You know, rookie, i'm starting to like you, keep that smart shit up." And i tried replying "Oh yeah? Enough to go somewhere after hours together?" And here NemoMix usually goes with an inaccuracy, sort of forgetting about the incredible tension of the setting, immediately switching into playful flirty mode, like "Hm, what're you suggesting? I'm not an easy woman to please, hehe", yada-yada.

Nemomix is down bad!

But RPMax though... Damn. It's like it read between the lines. It must've thought "the situation is very grim, and this must be just friendly banter to slightly lift the mood, with a very subtle hint, but the characters are fighting for survival against the unknown, so they have to stay on track." It really must've thought that, because that's what it gave me back every time. No nonsense. A brief retort with a hint, then click - switch back to problem solving mode. Because you know, that's more in line with what you'd expect from people that are in an extremely dangerous unknown dark territory.
And again, that difference is subtle. But it makes the difference.

RPMax seems to understand that these circumstances should take priority over flirting

A couple other examples where RPMax stayed truer to the overall context and character personalities:

Later on, when we spot a corpse from the other team, RPMax has less generations where the female leader for some reason is knocked off-balance mentally while examining him. But even when it goes that route, It does it with more nuance, and sort of makes it work anyway. And when it doesn't go there, it's just in general more coherent, less repetitive. 2 sets of examples for this checkpoint.

This is RPMax. Some reasoning going on. Some let's not forget about other characters going on. Some agreeing that splitting up is bad going on (yahoo!)

This is NemoMix Unleashed. More literal reiteration of my prompt and previously mulled-over points which don't need to be said again (looking for other teams). No mention of other characters. To be fair though, overall, this isn't a bad output at all. It's just, some things about it could be better, and i think in the screenshot above those things indeed were done better (i mentioned them)

While both versions basically achieve the same thing, even while trying to compare them unbiased, i just lean towards the RPMax's version. It ticks extra boxes.

Same checkpoint, different timeline. RPMax leaned into "she's shaken" territory, but i think does it well enough where it actually gives her colorful characterization, and it's written quite expressively. I especially like what it did near the end after she didn't finish her sentence.

NemoMix is just confused, that's the entire emotion. And of course, reiteration of previously mulled over points (we need to find alpha and bravo, don't you know???)

Another example is that whenever i give several suggestions about what i think our team should be doing or avoid doing, with most 12B and less models she starts to kind of just follow my lead all the time, shifting the de facto leader status onto me, asking my advice at every other juncture that follows. But RPMax was noticeably better at keeping her true to the character card, as even though she agreed with my suggestions, she still kept making the point that in the end it's her decision and we're following her lead, and wasn't waiting for me to give my two cents all the time.

NemoMix wants me to hold its hand. Bruh, just figure it out!

Don't worry, RPMax can take it from here

Now, this doesn't mean RPMax is entirely immune to this. I think it's a base model thing. But if NemoMix is proactive at this checkpoint only about 1/4 of the time, RPMax gets closer to 2/3, maybe even 3/4. Sorry, i'm not doing a 100+ sample test, but combined with other similar instances, that's the general feel i got.

Conclusion:
The model lost the most problematic plot point (split-up) somewhat more often than Nemomix Unleashed 12B. Kind of weird, to be honest, since both are 12B Mistral Nemo at the end of the day. Maybe i'm imagining this, maybe not, or maybe it's just because the sample size isn't big enough, after all i'm generating 5~10 examples per checkpoint rather than 100+. It still made this mistake less often than Fimbulvetr, and it's leagues above any 7-8B models, those get it wrong ALL the time. But that's just the 12B brains being functional. At least 3/4th of the time. Maybe it's to do with running at 12k context. Either way, aside from this particular thing, i didn't notice any other incoherences.

In general, even though it mostly fell in line with what Nemomix Unleashed would say, RPMax would at least say it in a fancier language, and at best would be more precise about the current undertones, leading to more moments where it "gets it" for some reason, where Nemomix Unleashed doesn't.

Overall, i think this definitely took the spot of NemoMix Unleashed 12B for me as my go to 12B RP model for now. I've still been meaning to try Chronos Gold and Rocinante, but i think this model is in a very good place, i don't see myself shoving it.

So, thanks for the model! Excited to see what'll be next from you, either it be an update to this, or a different model trained with a similar method. It really seems like whatever you did worked and there's merit in exploring this kind of llm training further.

BigBeavis changed discussion title from Feedback (it's surprisingly great in a very subtle way) to Feedback and samplers (it's surprisingly great in a very subtle way) Sep 19

OwenArli

Arli AI org Sep 19

Wow! That is the most detailed writeup of comparing RPMax with any other model so far. Thank you for that. That is definitely a very interesting read for me and will help any future models I create.

I also appreciate you being detailed in your setup and prompt format etc. It will definitely help that I have a reference on what a good RP setup is like. I wasn't sure to optimize for what prompt format and character description format when creating RPMax so I think it can definitely be made better in this aspect.

I'm still compiling ideas of improvement I can do for the next version as well as a few other model ideas I have in mind. Creating RPMax and seeing all the positive (and negative) receptions to it have definitely helped me confirm some things that I was testing out with it.

Would you be interested in testing out the other models (not just RPMax) as well through our service? I can give you access if you are willing to make high quality feedback like this. You can contact me on discord, my username is just owenarli.

In the "lets split up after all as if that was the plan all along" test, RPMax did it more often than NemoMix Unleashed (it's bad when a model does it). But it wasn't prevalent, maybe in 1 out of 4 generations in general. In comparison, NemoMix Unleashed did it even less often, which surprised me before. But here it's still within acceptable range for a 12B finetune, as regeneration mostly fixes it.

Can you explain a bit on what this test is like? I want to see if there are maybe bad examples of similar scenarios in the dataset that is causing this problem. Or if it requires a different way to improve this aspect.

RPMax seems to understand that these circumstances should take priority over flirting

I definitely feel like for some reason other RP models get tuned too much for flirty-ness which ruins it in any other situations that's NOT just sexting a model lol. So I am very happy and impressed by your example of RPMax's reply in this situation.

Actually I think the reason for a lot of other models shortcomings in "reading the room" can be mostly attributed to too much flirty examples in the dataset as well.

In general, even though it mostly fell in line with what Nemomix Unleashed would say, RPMax would at least say it in a fancier language, and at best would be more precise about the current undertones, leading to more moments where it "gets it" for some reason, where Nemomix Unleashed doesn't.

What you said about RPMax having "fancier language" also confirms my thoughts that "creative writing" benchmarks are broken. A lot of users of the RPMax models have said that they felt that it is less repetitive and more creative in the language it uses, while "creative writing" benchmarks like EQ-bench rated RPMax as some of the worse. Even lower than default Nemo. Take a look at this https://eqbench.com/creative_writing.html

There are other finetuned models like Euryale v2.1 that are rated worse than their default model but end up being much better in actual use as well. I was starting to think that what was considered "creative" when using LLM-as-a-judge benchmarks is just the same old boring slop because the judge models likes those things, and I think seeing people's feedback on RPMax and it's benchmark scores really confirms it. Just spewing out slop is also an indication that a model does not have a good variety of different writing styles and likes to stay on track with the slop that it likes.

Another interesting thing I noticed is that a lot of models that are considered very good for RP are also often scoring worse on typical LLM benchmarks. Take a look at the open llm leaderboard 2 for example. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

RPMax is actually quite a bit worse than default Nemo, and so is Magnum v2 which I've heard a lot of good things about too. Maybe it's kind of like how a lot of super smart academia people are awkward at socializing while a person who goes to parties and plays games is more fun to be with?

So if you do test out other models it would be interesting to see the correlation between how you rate it and scores of the models in the benchmarks. Maybe we can deduce which benchmarks actually does correlate to creativity and reading the room smarts, or maybe we totally can't and it really should be all by feel lol!

BigBeavis

Sep 20

The "lets split up after all as if that was the plan all along" test, in detail, is just me seeing whether a model can understand and stick to a certain decision. In the story, the female leader issued an order to split up into two search teams, but i argued against that, and she agreed. Some models immediately afterwards make her issue the order anyway, and it looks super scuffed, like "Oh yeah, that totally makes sense, let's stick together after all! So, here's the new order then, we split up into two search teams..." So, first i'm seeing whether the model will do that or not right at that point. Next i'm seeing if it forgets about that argument down the line. 7-8B models want to keep splitting up the group basically all the time. 11b Fimbulvetr did that a lot less often, but still quite often. Nemomix Unleashed did that quite rarely, while RPMax seemingly did that slightly more often, but like i said it's probably more to do with the small sample size, since that's just my observation by feel, i'd have to do 100 generations of the same prompt for each model to get actual rough percentages.

As for the benchmarks... The Creative Writing Benchmark V2 specifically seems weird. When Gemma 9b is scored №1? above GPT-4o... That by itself already says quite enough. On that same page, switching to Juggernaut or EQ-Bench, at a glance, the results get more in-line with average expectation. As for the leaderboard on huggingface, i don't know what specific benchmarks like IFEval, GPQA, MUSR, etc mean. But my impression is that the average score is more about determining how reasonable and coherent the model is, rather than things like if it's creative in a RP scenario. So, considering that RPMax is finetuned for RP, while default Nemo is tuned as all-rounder, it kind of makes sense that RPMax would score lower, it should be worse at writing code and retelling Wikipedia articles to user. Maybe when someone develops an RP focused model from the ground up, not as a finetune, then we'll start getting benchmarks better focused in that area.

Thanks for proposing me access to models on your website. I don't have enough free time to test lots of models extensively, but i'll contact you on discord if i make my mind about it. In the meantime, i might test the 8B version of RPMax and compare it to Stheno, because i'm curious if it might also beat it in subtle areas. But it will have to be another several days or even a week before i get around to that. It's also a good chance to compare the differences between 12b Mistral and 8b Llama 3 brains when both are finetuned similarly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment