Leaderboard

Top 10 AI models by quality.

Quality score uses HLE, SWE-Bench Pro, OpenRouter operational capability, and responsiveness from the published roster. For exact weights and coverage rules, see the Methodology page.

View

Quality

Leaderboard

Top 10 AI models by quality.

Quality score uses HLE, SWE-Bench Pro, OpenRouter operational capability, and responsiveness from the published roster. For exact weights and coverage rules, see the Methodology page.

View

Frontier watchlist

New frontier releases are under review

Claude Mythos Preview (release highly anticipated) and Gemini 3.5 Pro (release highly anticipated) are not yet included in the published roster.

Claude Mythos Preview and Gemini 3.5 Pro is on the watchlist because its release is highly anticipated; it will remain unranked until official availability and comparable benchmark evidence are published.

Before any model is ranked, we require comparable benchmark coverage from multiple accepted public sources, not just vendor-reported scores. We cross-check results across independent evaluation platforms and verify that each benchmark was run under consistent conditions before a score enters our methodology.

These models will be added to the leaderboard once that bar is met.

Claude Mythos PreviewGemini 3.5 Pro+4 more under review

Claude Mythos Preview: Release anticipated / limited-access only: Claude Mythos Preview is not generally available to the public, so it is excluded from the published roster.

Gemini 3.5 Pro: Frontier watchlist item only: official public Gemini 3.5 Pro release and comparable benchmark rows have not been verified yet.

Awaiting benchmarksPublication rules

Buyer-facing table

Ranked by quality score

TTFT and ARC-AGI-2 are withheld from this primary table until accepted coverage reaches 60% of the published roster. Available rows remain on model detail pages.

Published models ranked by quality score.
Rank	Model	Quality score	GPQA Diamond	HLE	Conversation	Context	Verdict
01	Claude Opus 4.8 Anthropic United States Compare with #2	100.0	n/a	49.8%	~392 chats 3K tokens/chat	1M	Open leaderboard verdict Leaderboard verdict Claude Opus 4.8 Curated editorial verdict Full verdict Claude Opus 4.8 is under active editorial review. Current public ranking data is limited to accepted source/fact evidence for benchmarks, pricing, and context rather than AI-generated score changes.
02	Qwen3.7 Max Qwen China Compare with #1	69.2	92.4%	41.4%	Free tier 3K tokens/chat	1M	Open leaderboard verdict Leaderboard verdict Qwen3.7 Max Curated editorial verdict Full verdict Qwen3.7 Max: A Specialist, Not a Generalist Released in May 2026, Alibaba’s Qwen3.7 Max is a formidable push into the proprietary frontier, trading casual versatility for elite performance in scientific reasoning, competitive math, and complex coding. Backed by a 1M-token context, blistering 206 t/s inference, and a highly competitive $2.50/M input price, it offers unmatched scale for heavy-lift pipelines. However, it demands careful architectural handling. Its notorious 22.9% "hallucination" rate is largely an artifact of epistemic humility—a 48% refusal rate on broad factual queries where the model simply says "I don't know." Furthermore, its deep-reasoning architecture makes it highly verbose, effectively tripling real-world token costs. Lacking vision capabilities and open weights, it still trails GPT-5.5 in raw reasoning headroom and Claude Opus 4.8 in coding ergonomics. The Bottom Line: Qwen3.7 Max is not a general-purpose chatbot. It is a high-octane reasoning engine built specifically for cost-constrained, multi-step agentic workflows. Route broad facts to lighter models, tame its verbosity with strict system prompting, and it will deliver frontier-class logic at a fraction of the cost.
03	GPT-5.5 OpenAI United States Compare with #1	66.1	93.6%	41.4%	~333 chats 3K tokens/chat	1M	Open leaderboard verdict Leaderboard verdict GPT-5.5 Curated editorial verdict Full verdict OpenAI’s GPT-5.5 is the best all-round AI subscription for buyers who want one powerful assistant that can do far more than answer questions. It combines polished writing, sharp research, coding help, data analysis, document creation, tool use, and superb built-in image generation inside one familiar workspace. Its biggest upgrade over GPT-5.4 is follow-through: it is better at taking messy, multi-step requests and turning them into finished work with less hand-holding. For non-technical users who want the most complete, capable, and easy-to-live-with AI experience, GPT-5.5 is the safest premium choice. GPT-5.5 sits inside the best complete AI workspace: writing, research, coding help, data analysis, documents, spreadsheets, tool use, and now very strong built-in image generation. OpenAI’s own release notes describe GPT-5.5 as built to understand complex goals, use tools, check its work, and carry multi-step tasks through to completion. ChatGPT Images 2.0 also adds stronger image generation, better text rendering, multilingual support, flexible visual styles, and more advanced creative control.
04	Gemini 3.1 Pro Google United States Compare with #1	65.8	94.3%	44.4%	API pricing not published 3K tokens/chat	1M	Open leaderboard verdict Leaderboard verdict Gemini 3.1 Pro Curated editorial verdictAI-assisted, editorially reviewed Full verdict Gemini 3.1 Pro is the ultimate all-in-one creative partner. It does more than chat; it builds. From generating cinematic video and studio-quality music to managing your life through seamless Google Workspace integration, it turns complex tasks into instant results. It is the fastest, most versatile tool for turning ideas into reality without needing a technical degree. True multimodality means it can create stunning video, professional images, and high-fidelity music in seconds. Its massive context window lets it remember entire books or long documents, so you do not have to repeat yourself. It works inside Gmail, Docs, and Drive to automate daily chores. It also delivers high-level reasoning and instant answers without the lag of older models. If you want an AI that acts as a creative studio, personal assistant, and expert researcher all in one subscription, Gemini 3.1 Pro is the gold standard.
05	Gemini 3.5 Flash Google United States Compare with #1	58.3	n/a	40.2%	~278 chats 3K tokens/chat	1M	Open leaderboard verdict Leaderboard verdict Gemini 3.5 Flash Curated editorial verdict Full verdict Released May 19, 2026, Gemini 3.5 Flash is Google DeepMind's most capable speed-optimized model to date and its clearest push toward action-oriented intelligence. It is best suited to high-throughput multimodal pipelines, long-horizon agentic workflows, parallel tool orchestration, and rapid iterative coding cycles where latency matters. The real upgrade over Gemini 3.1 Pro is where the gains land: terminal-style coding on TerminalBench climbed nearly 6 points, agent orchestration on MCP Atlas jumped over 5 points, and mathematical evaluation on GDPval-AA soared similarly — all areas that matter for autonomous software agents, not casual chat. Its 1M token context window and 280 tokens-per-second inference speed are genuine differentiators at this tier, and its low base price offers frontier-level execution at a fraction of the cost of heavy flagship models. The caveats are worth knowing. The model's new "thought preservation" architecture carries intermediate reasoning context across multi-turn sessions by default, meaning it can become unusually verbose and run up real-world costs significantly higher than the baseline rate card implies. When pushed to "high" thinking effort for complex reasoning, it can rapidly drain token budgets. It also shows a tendency to prioritize raw speed over absolute precision under pressure; reviewers note that it frequently glides past fine-grained prompt constraints and introduces minor breaking bugs that require multiple iterative loops to debug. Furthermore, Computer Use is entirely unsupported at launch, and casual writing or basic conversational tasks are not meaningfully improved. OpenAI's GPT-5.5 still leads on raw reasoning headroom and Claude 4.7 Opus consistently demonstrates superior first-pass accuracy in real-world software engineering sessions. Bottom line: Gemini 3.5 Flash earns its place when your work involves rapid agentic loops, heavy multimodal data ingestion, or multi-step tool execution at scale. It is the wrong pick for strict, budget-capped legacy pipelines that cannot absorb the token overhead of persistent internal reasoning, or any high-stakes deployment where absolute precision on the first try is paramount.Released May 19, 2026, Gemini 3.5 Flash is Google DeepMind's most capable speed-optimized model to date and its clearest push toward action-oriented intelligence. It is best suited to high-throughput multimodal pipelines, long-horizon agentic workflows, parallel tool orchestration, and rapid iterative coding cycles where latency matters. The real upgrade over Gemini 3.1 Pro is where the gains land: terminal-style coding on TerminalBench climbed nearly 6 points, agent orchestration on MCP Atlas jumped over 5 points, and mathematical evaluation on GDPval-AA soared similarly — all areas that matter for autonomous software agents, not casual chat. Its 1M token context window and 280 tokens-per-second inference speed are genuine differentiators at this tier, and its low base price offers frontier-level execution at a fraction of the cost of heavy flagship models. The caveats are worth knowing. The model's new "thought preservation" architecture carries intermediate reasoning context across multi-turn sessions by default, meaning it can become unusually verbose and run up real-world costs significantly higher than the baseline rate card implies. When pushed to "high" thinking effort for complex reasoning, it can rapidly drain token budgets. It also shows a tendency to prioritize raw speed over absolute precision under pressure; reviewers note that it frequently glides past fine-grained prompt constraints and introduces minor breaking bugs that require multiple iterative loops to debug. Furthermore, Computer Use is entirely unsupported at launch, and casual writing or basic conversational tasks are not meaningfully improved. OpenAI's GPT-5.5 still leads on raw reasoning headroom and Claude 4.7 Opus consistently demonstrates superior first-pass accuracy in real-world software engineering sessions. Bottom line: Gemini 3.5 Flash earns its place when your work involves rapid agentic loops, heavy multimodal data ingestion, or multi-step tool execution at scale. It is the wrong pick for strict, budget-capped legacy pipelines that cannot absorb the token overhead of persistent internal reasoning, or any high-stakes deployment where absolute precision on the first try is paramount.
06	Claude Sonnet 4.6 Anthropic United States Compare with #1	56.2	89.9%	33.2%	~654 chats 3K tokens/chat	1M	Open leaderboard verdict Leaderboard verdict Claude Sonnet 4.6 Curated editorial verdictAI-assisted, editorially reviewed Full verdict Claude Sonnet 4.6 is Anthropic's everyday AI model, released in February 2026, and the default for all free and standard subscribers. It approaches Opus-level intelligence at a price point that makes it practical for far more tasks Anthropic - making it the best value option in the Claude lineup. It handles writing, research, document analysis, and everyday questions with impressive accuracy and speed. It can hold entire codebases, lengthy contracts, or dozens of research papers in a single session Eesel AI, and reasons effectively across all of it. Early users report near human-level capability in tasks like navigating complex spreadsheets or filling out multi-step web forms. Anthropic Best suited for users who want a fast, reliable, and highly capable AI assistant for daily personal or professional use without needing the deepest reasoning that Opus offers
07	DeepSeek V4 Pro (Max) DeepSeek China Compare with #1	44.6	90.1%	33.5%	Free tier 3K tokens/chat	1M	Open leaderboard verdict Leaderboard verdict DeepSeek V4 Pro (Max) Curated editorial verdictAI-assisted, editorially reviewed Full verdict Released April 2026, DeepSeek V4 Pro (Max) is a serious cost-efficiency challenger for buyers who care about frontier intelligence without frontier infrastructure costs. It competes with leading Western frontier models on complex reasoning, document analysis, and sustained multi-step work, while appearing to require far fewer processing resources for the level of capability delivered. Its strengths are broad versatility: long-context work that stays coherent, useful creative writing, strong coding benchmark evidence, and interactions that feel more thoughtful than formulaic. The caveats are still real: Western models may retain an edge on some narrow coding benchmarks, deeper web-search integration, and enterprise ecosystem maturity, and the low unit cost can encourage enough usage that teams should still watch total volume. Bottom line: DeepSeek V4 Pro (Max) is frontier-level capability at unusually aggressive economics. If you want one of the smartest models your money can buy, it belongs high on the shortlist.
08	Grok 4.3 xAI United States Compare with #1	35.4	90.1%	35.0%	~6,667 chats 3K tokens/chat	1M	Open leaderboard verdict Leaderboard verdict Grok 4.3 Curated editorial verdict Full verdict Grok 4.3 is xAI’s cost-optimized reasoning model released around early 2026. It delivers solid performance on complex logic, math, agentic workflows, and long-context tasks (1M tokens), with strong tool use and factual focus. Strengths: Significantly cheaper and more efficient than Grok 4, improved readability/formatting, and practical for high-volume or office-style automation. It prioritizes utility over raw benchmark dominance. Weaknesses: Trails frontier leaders like top Claude or GPT variants in peak precision coding, deep creativity, or the hardest reasoning benchmarks. Occasional inconsistency remains. Verdict: A pragmatic, affordable workhorse rather than the undisputed smartest model. Excellent value for everyday power users who need speed and scale over absolute cutting-edge performance. Solid evolution.
09	MiniMax M2.7 MiniMax China Compare with #1	31.3	n/a	28.1%	Free tier 3K tokens/chat	204.8K	Open leaderboard verdict Leaderboard verdict MiniMax M2.7 Curated editorial verdictAI-assisted, editorially reviewed Full verdict MiniMax M2.7 is included in the coding leaderboard on the strength of sourced SWE-Bench Pro evidence. The row should be read as coding evidence first; buyer claims beyond pricing, access, and benchmark provenance should stay conservative until refreshed from first-party sources.
10	Kimi K2.6 Moonshot AI China Compare with #1	17.2	n/a	18.2%	~333 chats 3K tokens/chat	262.1K	Open leaderboard verdict Leaderboard verdict Kimi K2.6 Curated editorial verdict Full verdict Released April 20, 2026, Kimi K2.6 is an open-source Moonshot AI model built for coding and autonomous task execution rather than general-purpose chat. Its best fit is teams that want near-flagship coding performance without flagship pricing. At $0.95 per million uncached input tokens and $4.00 per million output tokens, with cheaper cached input available, it gives cost-sensitive engineering teams a serious alternative to proprietary coding models. The tradeoff is polish: creative writing trails Claude and ChatGPT, English and Chinese are stronger than other languages, and response speed is slow compared with the fastest frontier options. It is also operated by a Chinese company under local data regulations, so government, defense, and heavily regulated teams should review compliance before sending sensitive work. Bottom line: Kimi K2.6 is a compelling Claude or GPT alternative for development work when cost efficiency matters more than raw polish.