lmarena-ai/chatbot-arena-leaderboard · I tried to plot AGI on the same Elo scale by comparing to "both bad" and "tie" votes

(Or, rather, I had an LLM write it for me. (But another LLM checked it and said it was correct, so...))

When a battle is voted as a tie, the "ideal model" is also considered to have tied with both. When a battle is voted as "both bad", then the ideal model is considered to have beaten both. So it acts as an upper bound for Elo scores, and since the judgments are from humans, a model that scores that well all the time would be human-equivalent?

https://gist.github.com/endolith/e001d8b7811699cf9be822a774e7cb67