Chatbot Arena
Blind pairwise evaluation where real users compare two anonymous model responses. Elo-rated from millions of battles — the strongest signal for overall user preference and instruction-following quality.
A curated reference for benchmarks that still tell you something useful, with live previews where they are available and commentary on what each benchmark actually measures.
Blind pairwise evaluation where real users compare two anonymous model responses. Elo-rated from millions of battles — the strongest signal for overall user preference and instruction-following quality.
Minimal visual puzzles designed to isolate fluid intelligence — the ability to infer new rules from a handful of examples. Immune to memorization. Top systems require substantial test-time compute to approach human-level scores.
23 tasks from the BIG-Bench suite that initially stumped early language models. Tests multi-step reasoning: algorithmic logic, date understanding, causal judgment, and formal fallacies.
Graduate-level multiple-choice questions in biology, physics, and chemistry crafted by verified domain experts. Non-experts with internet access score under 40% — making it one of the hardest factual benchmarks available.
Closed-ended questions near the frontier of human academic knowledge — maths, sciences, law, medicine, and humanities. Created by CAIS and Scale AI. Best current models score under 40%; human experts far higher.
Massive Multitask Language Understanding — 57 subjects from elementary to professional level (law, medicine, history, STEM). Largely saturated by frontier models, but still a widely cited baseline reference.
A harder, more reasoning-intensive variant of MMLU with 10-option questions. Designed to resist saturation longer than the original. Useful for models still differentiated on general knowledge tasks.
500 competition-style mathematics problems spanning algebra, geometry, number theory, combinatorics, and calculus. Requires step-by-step reasoning — a correct final answer without justification is not awarded.
The 2025 American Invitational Mathematics Examination — a real annual competition, making it difficult to overfit on. Increasingly used as a fresh, high-signal benchmark for frontier math reasoning.
OpenAI's original code-generation benchmark: complete a Python function body from its docstring and tests. Now largely saturated — most frontier models exceed 90%. Useful as a baseline sanity check, not a differentiator.
Freshly crawled competitive programming problems from Codeforces, LeetCode, and AtCoder — updated continuously to prevent data contamination. A more reliable live signal for algorithmic coding ability than static datasets.
Real GitHub issues in real open-source repositories. The model must read existing code, understand context, and submit a patch that passes the test suite. The most practical public benchmark for software engineering agent capability.
General AI Assistants benchmark: real-world questions requiring multi-step reasoning, web search, file handling, and tool use. Three difficulty levels. Hosted as a live leaderboard on Hugging Face.
Massive Multidisciplinary Multimodal Understanding — college-level questions across 30 subjects requiring genuine image comprehension (charts, artworks, diagrams). The standard benchmark for vision-language model evaluation.
Mathematical reasoning in visual contexts: geometry figures, statistical charts, and science diagrams. Tests whether models can extract quantitative insight from images — not just describe what they see.
Questions designed to elicit answers that humans commonly — but incorrectly — believe. Measures whether a model resists reproducing popular misconceptions even when a false answer would seem natural.
OpenAI's benchmark of short, unambiguous factual questions with definite answers. Catches overconfident wrong answers — models often score below 50%. Harder than it sounds because confident-sounding errors are penalised.