🚩 Report

#6
by gileneo - opened

Reflection 70b benchmarks are not real

The whole drama is described here:
https://x.com/shinboson/status/1832933753837982024

Matt Schumer is a fraud.

I literally could not reproduce a single thing that twitter thread was posting, also chat template doesn't work for lm_eval_harness to run it in non-api mode so i'm not sure what you guys are making the assumptions on.

Screenshot 2024-09-09 at 12.51.43 AM.png

The claim is that the requests sent to "Reflection 70B" through his hosted API were being routed to a model other than the one whose weights are hosted in this HF repo. The fact that you're unable to reproduce any of the responses other users saw from their API when you run it locally is further evidence that what was benchmarked is not what is in this repo.

@nisten You are missing the point here. The model uploaded here is not the same as the API Access that others can access. (e.g. Artificial Analysis). You will not be able to reproduce the issues with an open weight model. Until Matt himself uploads and provides proof of the model, evals, and the same exact prompt to test with his open weight model. (Not a private one)

This comment has been hidden

Right, so TLDR; Matt Schumer is a fraud using the media as pawns to promote his company. Tik-Tok Matt

Is there a single redditor here from r/localllama that has actually ran the model locally?

Hang on , maybe these ARE actual fraud accounts commenting

Screenshot 2024-09-09 at 8.09.43 PM.png

Screenshot 2024-09-09 at 8.07.13 PM.png

Screenshot 2024-09-09 at 8.11.39 AM.png

let them upload it to Github Models if it's so good. Let's keep this space clear of the scams please.

@nisten , reply to this before you claim anything.

Can you read titles of these posts? They're talking about thees official "reflection 70b APIs". Do you test on these APIs?

I see you're posting the results related to this twitter.

Do you really understand what he is trying to prove? He's trying to prove that the LLM behind "reflection 70b API" is using the same tokenizer as claude 3, chatgpt4o or whatever. Images he posted stands by his point.

What are you trying to prove here by posting this image? I think you're proving that what they uploaded here and what they host after API are totally different. You should explain what you want to prove in detail.

Also, I see you're using local models, so you're testing different models from all these posts claims. A natural question is that can you reproduce the evaluation results @mattshumer provided? Why not post your independent evaluation results here so you can help everyone decide whether they're genuine or overclaiming?

media_GW0X-qPbAAAsRF9.png

This is a local model.
You are coming from r/local_llama, to complain about about a model which you're NOT running locally.
Please, RUN IT LOCALLY, then post screenshots of WHAT YOU LOCALLY RAN!

COMPRENDE, CAPISCI, KUPTON?

Can you fix the chat_template, HERE, not on reddit, not on uncle Elons twitter, but HERE, and then run it BEFORE yapping?

@nisten

Before you report your independent evaluation results, please disclose whether you and @mattshumer have a conflict of interest.
In particular, relationships like friends, partners, knowing each other, etc.

No I don't actually have for real but I think we all need a new opensource license that's apache for everyone except reddit users.
So go back to r/locallama and tell them that Enigrand's yapping has inspired Nisten to make an opensource license that bans reddit users.

@nisten

"No I don't actually have for real but I think we all need a new opensource license that's apache for everyone except reddit users."


Quoting from your twitter post:

i dont know what he's on about with torrents, he hasnt slept in 4 days,

checkpoint 3 is working fine as far as I tested, albeit not great (it goes in loops), but IT WAS PASSING MOST OF THE TESTS ya'll claimed it didnt

Ok please just tell me what prompts to try?


EXPLAIN WHO IS HE or CHANGE YOUR DISCLOSURE. Also, please don't delete your twitter posts.

Can you lay off the amphetamines for one day and actually try and run the model locally please ?
It seems to perform a lot better with my chat template applied. I tried it with mistral large and it didn't do the counting properly.

Screenshot 2024-09-10 at 1.28.19 AM.png

Screenshot 2024-09-10 at 1.22.26 AM.png

@nisten

CHANGE YOUR DISCLOSURE before you claim anything else.

Here's the evaluation results from Kristoph on twitter.

These are the final notes from my work on the Reflection model. I tested the latest version of the model hosted by @hyperbolic_labs. I attempted a variety of different strategies including variation in temperature and system prompt. Ultimately these had only modest impact on the results. The final numbers I am presenting here use the prompt the Reflection team recommended. I did have to modify the question format somewhat to ensure Reflection properly generated the response ( the instruction to output a letter choice was at the end of the prompt )

The TLDR is that on virtually every benchmark the Reflection model was on par with Llama 3.1 70B it is based on.

I ultimately ran through the entire corpus of MMLU Pro for biology, chemistry, physics, engineering, health, law, philosophy, and math. All 0 shot. In all but one case Reflection was within a 1-2% of Llama 3.1 70B 0 shot and 1-3% below 5 shot. In all cases Llama 70B was called with no system prompt.

The one area where Reflection performed better was in Math where it scored 3% higher.

Sign up or log in to comment