𝗝𝘂𝗱𝗴𝗶𝗻𝗴 𝘁𝗵𝗲 𝗝𝘂𝗱𝗴𝗲𝘀: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝗩𝘂𝗹𝗻𝗲𝗿𝗮𝗯𝗶𝗹𝗶𝘁𝗶𝗲𝘀 𝗶𝗻 𝗟𝗟𝗠𝘀-𝗮𝘀-𝗝𝘂𝗱𝗴𝗲𝘀

Community Article Published June 24, 2024

𝐂𝐚𝐧 𝐋𝐋𝐌𝐬 𝐬𝐞𝐫𝐯𝐞 𝐚𝐬 𝐫𝐞𝐥𝐢𝐚𝐛𝐥𝐞 𝐣𝐮𝐝𝐠𝐞𝐬 ⚖️?

We aim to identify the right metrics for evaluating Judge LLMs and understand their sensitivities to prompt guidelines, engineering, and specificity. With this paper, we want to raise caution ⚠️ to blindly using LLMs as human proxy.

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

Arxiv link - https://arxiv.org/abs/2406.12624

Tweet Summary - https://x.com/iamsingh96aman/status/1804148173008703509

image/png

Key findings -

🌟 𝗧𝗼𝗽 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗲𝗿𝘀: Only 𝗚𝗣𝗧-𝟰 and 𝗟𝗟𝗮𝗺𝗮-𝟯 𝟳𝟬𝗕 shine among 9 judge models. However, they still fall short of inter-human annotator agreement.

image/png

📊 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗠𝗲𝘁𝗿𝗶𝗰: Scores assigned by judges with 80%+ percent alignment with humans can be 20 points apart! Cohen's kappa is a superior metric.

image/png

⚖️ 𝗥𝗮𝗻𝗸𝗶𝗻𝗴 𝘃𝘀 𝘀𝗰𝗼𝗿𝗶𝗻𝗴: Most aligned in scores != most discriminative, in some cases, judge models with low alignment such as Contains (lexical match), and JudgeLM-7B outperform better models in terms of 𝑟𝑎𝑛𝑘𝑖𝑛𝑔 models, because their biases are more systematic.

image/png

🧩 𝗟𝗲𝗻𝗶𝗲𝗻𝗰𝘆: Judge LLMs tend to be more lenient than strict.

image/png

🎭 𝗩𝘂𝗹𝗻𝗲𝗿𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Judge LLMs can be easily tricked by controlled responses like "Yes," "Sure," and "I don't know."

image/png

🎯 𝗖𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆: It's not easy to steer large models while smaller models get confused by adding too much detail.

image/png