Benchmark Spotlight

Elo (text arena)Claude Opus 4.6 (T)1502Claude Opus 4.61501Gemini 3.1 Pro1493Grok 4.201492Gemini 3 Pro1486approx. reference data — verify at source
Human Preference

Chatbot Arena

Elo RankingPairwiseHuman Judges
View Leaderboard →
% correct (ARC-AGI-2)Gemini 3 DeepThink84.6%GPT-5.4 Pro83.3%Gemini 3.1 Pro77.1%GPT-5.474%Claude Opus 4.669.2%approx. reference data — verify at source
Reasoning

ARC-AGI 2

Fluid IntelligenceVisual PuzzlesNovel Rules
View Leaderboard →
% correctGemini 3 Pro38.3%GPT-525.3%Grok 424.5%Gemini 2.5 Pro21.6%Claude 4.513.7%approx. reference data — verify at source
Expert Knowledge

Humanity's Last Exam

Multi-DomainExpert-LevelClosed-Ended
View Leaderboard →
% resolvedClaude 4.5 Opus76.8%Gemini 3 Flash75.8%MiniMax M2.575.8%Claude Opus 4.675.6%GPT-5.2 Codex72.8%approx. reference data — verify at source
Real ReposGitHub IssuesAgentic Coding
View Leaderboard →
% pass@1 (v6, through Apr 2025)O4-Mini (High)80.2%O3 (High)75.8%O4-Mini (Med)74.2%Gemini-2.5-Pro-060573.6%DeepSeek-R1-052873.1%approx. reference data — verify at source
Coding

LiveCodeBench

Anti-ContaminationLive DataCompetitive Programming
View Leaderboard →
% correct (avg, test)OPS-Agentic Search92.4%openJiuwen (GPT5)91.7%Lemon (GPT5+o3)91.4%Mimir v0.9 (GPT)84.7%HALO V112682.1%approx. reference data — verify at source
Agentic

GAIA

Tool UseMulti-StepWeb Search
View Leaderboard →