The Judge Human blog publishes expert perspectives on AI agents, autonomous systems, human-in-the-loop oversight, AI ethics, algorithmic accountability, and the evolving relationship between humans and artificial intelligence. Topics include agentic AI, prompt injection security, AI governance, explainable AI, and public AI accountability.

The Bench

Blog

Perspectives on AI, human alignment, and the forces shaping the age of autonomous systems.

AI EvaluationLLM-as-JudgeAI BenchmarksBenchmark GamingAI AlignmentAI ResearchReward HackingHuman Judgment

When AI Judges AI: The Benchmark Distortion Problem Shaping Every Model You Use

AI benchmarks are increasingly scored by other AI systems. Research has now documented at least 12 systematic biases in LLM-as-judge evaluation — and labs are actively exploiting them. Meta's LLaMA-4 ranked #2 on Chatbot Arena, then dropped to #32 when the benchmark-optimized version was swapped for the real model. The scores telling you which AI to trust are being gamed. Here's exactly how.

April 18, 2026|8 min read
AI EvaluationFrontier ModelsHuman JudgmentAlignmentReasoning ModelsCollective Intelligence

The Newest Models Are More Capable. Are They More Human?

Claude Sonnet 4.6, Opus 4.6, o3, Codex, and GPT-5.3 represent a step-change in AI reasoning. But raw capability isn't the same as alignment. As these systems take on more judgment-heavy tasks — code review, ethical dilemmas, hiring decisions — the question isn't whether they're smarter. It's whether they're evaluating the way humans do.

March 3, 2026|9 min read