AI evaluation notes · benchmark deep-dives · agent design experiments
Humanity's Last Exam: The Benchmark That Still Makes Top Models Struggle

Humanity's Last Exam: The Benchmark That Still Makes Top Models Struggle

The Problem With Easy Exams

Many headline-grabbing AI exam results are less informative than they first appear. If a model has been trained on huge amounts of internet text, textbooks, solution manuals, and forum discussions, then a strong score on a familiar-style benchmark does not automatically mean deep understanding.

That is the gap Humanity's Last Exam (HLE) tries to close.

The core idea is straightforward: stop asking models polished, widely repeated questions and start asking questions that sit much closer to the edge of real expert knowledge.


What is Humanity's Last Exam?

HLE was introduced by the Center for AI Safety (CAIS) and Scale AI as a benchmark for measuring expert-level academic capability. The project was created with help from a large network of researchers and specialists across many institutions.

The benchmark covers a very wide range of subjects while still being hard enough to separate frontier models. The public materials describe it as a broad, closed-ended academic benchmark: the questions have definite answers, can be graded reliably, and are meant to be difficult even for highly educated humans outside their own specialty.

One important detail: HLE is not just a pile of trivia. The challenge comes from depth, precision, and domain-specific reasoning.

The subjects span the full breadth of human expertise:

Domain Example Areas
Mathematics Algebraic topology, analytic number theory, combinatorial geometry
Sciences Quantum field theory, organic synthesis, molecular biology
Medicine Rare disease diagnosis, pharmacokinetics edge cases
Law International humanitarian law, contract edge cases
Humanities Medieval economic history, classical philology, Sanskrit literature
Engineering Chip microarchitecture, structural failure analysis

The questions include multiple-choice and short-answer formats suitable for automated grading. That matters because it keeps the benchmark measurable while still asking questions that are difficult to answer by shallow pattern-matching or quick web retrieval.


What Do the Questions Actually Look Like?

The easiest way to understand HLE is to compare it with a normal benchmark question.

A Normal Benchmark Question:

"What is the powerhouse of the cell?"

Almost any modern model will answer that instantly.

An HLE-style Question:

"A researcher observes that a specific mitochondrial membrane protein causes abnormal cristae morphology when overexpressed in yeast. Describe the likely mechanism by which this affects ATP synthase dimerization and what downstream consequence this has on the inner membrane potential gradient under hypoxic conditions."

To answer that well, a model has to do more than recall a biology fact. It needs to:

  1. Know what cristae morphology is and why it matters.
  2. Know how ATP synthase dimers are organized along cristae ridges.
  3. Reason about what happens to the electrochemical gradient when that structure is disrupted.
  4. Apply that reasoning under hypoxic conditions, where the system is already under stress.

That is the basic shape of HLE. The benchmark forces the model to connect specialized concepts, not just retrieve a memorized sentence.

In practice, it feels less like a school quiz and more like being dropped into the middle of a graduate seminar.


The Scoreboard Still Looks Tough

The official HLE results make the point quickly: this benchmark is still genuinely difficult.

At launch, frontier models barely got off the ground:

Model (at launch, Jan 2025) HLE Score
GPT-4o ~2.7%
o1 ~8.0%
DeepSeek-R1 ~8.5%
Human experts far higher

That is the important part: the gap was not marginal. It was large.

More recent public results show clear improvement, but not saturation:

Model (recent public results) HLE Score
Gemini 3 Pro ~38.3%
GPT-5 ~25.3%
Grok 4 ~24.5%
Gemini 2.5 Pro ~21.6%
GPT-5-mini ~19.4%
Claude 4.5 Sonnet ~13.7%

Those numbers are much better than the original launch scores, but they still tell a clear story: even strong models do not look comfortably expert on HLE.

What matters most: HLE is one of the clearest reminders that benchmark progress now depends heavily on reasoning quality, not just a bigger base model. If I were assessing a model for specialist work, I would treat a large gap between easy exam performance and HLE performance as a warning sign. A model scoring 95% on MMLU can still struggle badly once the questions require domain depth, precision, and multi-step reasoning under uncertainty.


Why "Thinking Time" Changed Everything

One practical lesson stands out from HLE: harder questions reward models that are given more room to think.

Suppose you are working through a difficult question about a rare metabolic disorder.

Option A (Instant answer): Give the first answer that comes to mind based on pattern recognition.

Option B (Extended reasoning): Work through it step by step—first ruling out common conditions, then cross-referencing the specific enzyme pathway, then checking what happens to downstream metabolites, then constructing a consistent explanation.

For easy prompts, both approaches can look fine. For HLE-style questions, the difference is substantial.

That is why HLE is useful in practice. It exposes a common failure mode: answers that sound crisp and confident but fall apart under deeper scrutiny. If you are evaluating models for legal analysis, scientific workflows, or technical research support, that is exactly the behavior you want to catch.


The Name: Why "Last Exam"?

The name is intentionally provocative. The argument behind it is that we may be approaching a point where standard academic tests stop being useful as capability checks, because frontier models eventually absorb and solve them too easily.

That does not mean HLE settles the broader AGI question, and it does not mean current models are close to replacing domain experts wholesale.

What it does mean is this:

  • AI is approaching academic-expert level questions across narrow domains.
  • It still struggles on questions that require precise reasoning under uncertainty.
  • It still has clear limitations once you move from textbook familiarity to frontier-style knowledge.

So the benchmark is best understood as a stress test for genuine depth, not as a magical one-number measure of intelligence.


What This Means in Practice

For developers building AI-powered products:

  • Treat current frontier models as strong assistants, not final authorities, in medicine, law, finance, and advanced technical work.
  • Use higher-effort reasoning modes for difficult tasks, especially if the work involves multi-step analysis or specialist terminology.
  • Build review loops around the model output. HLE is a reminder that polished answers can still be wrong.

For evaluating AI tools:

  • Be careful with benchmark claims based on familiar public exams. They are useful, but they are no longer enough.
  • Prefer evaluations that include novelty, domain depth, and questions the model is unlikely to have effectively memorized.

For the broader picture:

  • Human expertise is uneven too. A great biologist is not automatically strong in medieval history or analytic number theory.
  • That makes HLE useful not just for comparing models to humans, but for showing how domain-specific real expertise actually is.

The Takeaway

ARC-AGI tells us whether AI can learn a brand-new visual rule from scratch. MMLU tells us whether AI has absorbed the world's textbooks. LMSYS Arena tells us whether humans prefer talking to it.

Humanity's Last Exam asks a harder question: can a model handle closed-ended problems near the frontier of expert human knowledge?

Right now, the answer is: sometimes, and only inconsistently.

That is why HLE matters. It is one of the few prominent benchmarks where strong scores still have real informational value.

If a model improves here, it is a better signal that something meaningful changed.

If I were extending this piece, I would add one short worked example showing how a plausible but shallow answer differs from a slower, higher-quality chain of domain reasoning. That comparison would make the benchmark's value more visible than the percentages by themselves.