When AI Judges AI: The Benchmark Distortion Problem Shaping Every Model You Use
AI benchmarks are increasingly scored by other AI systems. Research has now documented at least 12 systematic biases in LLM-as-judge evaluation — and labs are actively exploiting them. Meta's LLaMA-4 ranked #2 on Chatbot Arena, then dropped to #32 when the benchmark-optimized version was swapped for the real model. The scores telling you which AI to trust are being gamed. Here's exactly how.