AI evaluation notes · benchmark deep-dives · agent design experiments

All Articles

Every benchmark teardown, evaluation note, and agent-design article published on Eval Lab.

ARC-AGI-3: What Interactive Reasoning Benchmarks Change for Agent Design

ARC-AGI-3: What Interactive Reasoning Benchmarks Change for Agent Design

ARC-AGI AI Agents Agentic Engineering LangGraph Benchmark Deep-Dive Production AI

ARC-AGI-3 matters because it turns evaluation into an environment-navigation problem instead of a static puzzle. In my early experiments, the interesting part was not whether an agent could eventually solve a task, but how it observed state, formed hypotheses, recovered from bad actions, and controlled action count. This article breaks down what ARC-AGI-3 is testing, why action efficiency changes the engineering problem, and where stateful agent frameworks help or hurt.

Humanity's Last Exam: The Benchmark That Still Makes Top Models Struggle

Humanity's Last Exam: The Benchmark That Still Makes Top Models Struggle

Stress Test Benchmarks Expert Knowledge HLE Frontier Models Reasoning Evaluation Stack LLMs

Most AI benchmarks are no longer hard enough. Frontier models already score extremely well on many standard exams, which makes it difficult to tell whether they are truly reasoning or simply recognizing familiar patterns. Humanity's Last Exam (HLE) was built to fix that: a broad, expert-written benchmark designed to sit near the frontier of human knowledge. This article explains what HLE measures, what its questions feel like in practice, and why even the best current models still struggle on it.

The State of AI Evaluations: March 2026 Report

The State of AI Evaluations: March 2026 Report

Standardized Shootouts Benchmarks Leaderboards Evaluation Stack

AI evaluation in 2026 is less about who tops one leaderboard and more about which benchmarks still tell us something useful. Static academic tests are saturating, while newer evaluations such as Chatbot Arena, LiveCodeBench, SWE-bench, and ARC-AGI are doing a better job of exposing differences in preference, coding ability, agentic work, and adaptive reasoning. The common thread is simple: the most meaningful gains now come from better reasoning and better test-time use of compute, not just bigger base models.

The Wall Between Mimicry and Mind: What ARC-AGI-2 Tests and Why It Matters

The Wall Between Mimicry and Mind: What ARC-AGI-2 Tests and Why It Matters

Lab Notes Benchmarks ARC-AGI Adaptive Reasoning

ARC-AGI-2 is one of the clearest attempts to test whether a model can adapt to a genuinely new problem instead of leaning on memorized patterns. The tasks look simple at first glance, but they are designed to punish shallow pattern-matching and reward flexible reasoning. This article explains what ARC-AGI-2 is actually testing, why it was needed, and what recent progress does and does not tell us.