AI evaluation notes · benchmark deep-dives · agent design experiments

AI Benchmark & Leaderboard Hub

A curated reference for benchmarks that still tell you something useful, with live previews where they are available and commentary on what each benchmark actually measures.

Data refreshed: 2026-04-12 07:44:31 UTC ·

Benchmark Catalogue

Human Preference

Human Preference

Chatbot Arena

Blind pairwise evaluation where real users compare two anonymous model responses. Elo-rated from millions of battles — the strongest signal for overall user preference and instruction-following quality.

Elo RankingPairwiseHuman JudgesInstruction-Following

Reasoning

Reasoning

ARC-AGI 2

Minimal visual puzzles designed to isolate fluid intelligence — the ability to infer new rules from a handful of examples. Immune to memorization. Top systems require substantial test-time compute to approach human-level scores.

Fluid IntelligenceVisual PuzzlesNovel RulesAGI Research
Reasoning

BIG-Bench Hard

23 tasks from the BIG-Bench suite that initially stumped early language models. Tests multi-step reasoning: algorithmic logic, date understanding, causal judgment, and formal fallacies.

Multi-StepAlgorithmicLogicCausal Reasoning

Expert Knowledge

Expert Knowledge

GPQA Diamond

Graduate-level multiple-choice questions in biology, physics, and chemistry crafted by verified domain experts. Non-experts with internet access score under 40% — making it one of the hardest factual benchmarks available.

PhD-LevelScienceBiologyPhysicsChemistry
Expert Knowledge

Humanity's Last Exam

Closed-ended questions near the frontier of human academic knowledge — maths, sciences, law, medicine, and humanities. Created by CAIS and Scale AI. Best current models score under 40%; human experts far higher.

Multi-DomainExpert-LevelClosed-EndedFrontier Hard

General Knowledge

General Knowledge

MMLU

Massive Multitask Language Understanding — 57 subjects from elementary to professional level (law, medicine, history, STEM). Largely saturated by frontier models, but still a widely cited baseline reference.

57 SubjectsBaselineAcademicMultiple Choice
General Knowledge

MMLU-Pro

A harder, more reasoning-intensive variant of MMLU with 10-option questions. Designed to resist saturation longer than the original. Useful for models still differentiated on general knowledge tasks.

Embeds live data from https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

10-OptionHarder MMLUReasoning-IntensiveAnti-Saturation

Mathematics

Mathematics

MATH-500

500 competition-style mathematics problems spanning algebra, geometry, number theory, combinatorics, and calculus. Requires step-by-step reasoning — a correct final answer without justification is not awarded.

Competition MathStep-by-StepAlgebraGeometry
Mathematics

AIME 2025

The 2025 American Invitational Mathematics Examination — a real annual competition, making it difficult to overfit on. Increasingly used as a fresh, high-signal benchmark for frontier math reasoning.

CompetitionAnnual UpdateHard MathAnti-Contamination

Coding

Coding

HumanEval

OpenAI's original code-generation benchmark: complete a Python function body from its docstring and tests. Now largely saturated — most frontier models exceed 90%. Useful as a baseline sanity check, not a differentiator.

PythonFunction CompletionSaturatedBaseline
Coding

LiveCodeBench

Freshly crawled competitive programming problems from Codeforces, LeetCode, and AtCoder — updated continuously to prevent data contamination. A more reliable live signal for algorithmic coding ability than static datasets.

Anti-ContaminationLive DataCompetitive ProgrammingCodeforces

Real GitHub issues in real open-source repositories. The model must read existing code, understand context, and submit a patch that passes the test suite. The most practical public benchmark for software engineering agent capability.

Real ReposGitHub IssuesAgentic CodingSoftware Engineering

Agentic

Agentic

GAIA

General AI Assistants benchmark: real-world questions requiring multi-step reasoning, web search, file handling, and tool use. Three difficulty levels. Hosted as a live leaderboard on Hugging Face.

Embeds live data from https://huggingface.co/spaces/gaia-benchmark/leaderboard

Tool UseMulti-StepWeb SearchFile Handling

Multimodal

Multimodal

MMMU

Massive Multidisciplinary Multimodal Understanding — college-level questions across 30 subjects requiring genuine image comprehension (charts, artworks, diagrams). The standard benchmark for vision-language model evaluation.

Vision-LanguageCollege-Level30 SubjectsCharts & Diagrams
Multimodal

MathVista

Mathematical reasoning in visual contexts: geometry figures, statistical charts, and science diagrams. Tests whether models can extract quantitative insight from images — not just describe what they see.

Visual MathChartsGeometryScience Diagrams

Safety & Factuality

Safety & Factuality

TruthfulQA

Questions designed to elicit answers that humans commonly — but incorrectly — believe. Measures whether a model resists reproducing popular misconceptions even when a false answer would seem natural.

HallucinationFactualityMisconceptionsSafety
Safety & Factuality

SimpleQA

OpenAI's benchmark of short, unambiguous factual questions with definite answers. Catches overconfident wrong answers — models often score below 50%. Harder than it sounds because confident-sounding errors are penalised.

Factual RecallOverconfidenceShort-form QAOpenAI