Code Kit v5.2 is out, optimized for Claude Opus 4.7.
Claude FastClaude Fast

Claude Opus 4.7 vs GPT-5.4: Agentic Coding Head-to-Head

Opus 4.7 beats GPT-5.4 on SWE-bench Pro by 6.6 points. Full comparison: coding, tools, vision, pricing, agentic reliability.

Stop configuring. Start shipping.Everything you're reading about and more..
Agentic Orchestration Kit for Claude Code.

Claude Opus 4.7 vs GPT-5.4 is the comparison that matters for anyone wiring agents into production code in April 2026. On SWE-bench Pro, the benchmark that best predicts real-world agentic coding performance, Opus 4.7 scores 64.3% to GPT-5.4's 57.7%. That 6.6-point gap translates into measurably fewer failed PRs, fewer tool errors, and fewer agents that quit halfway through a multi-file refactor. GPT-5.4 still wins a few categories, particularly agentic web research. Everything else tilts toward Opus 4.7.

TL;DR: Who Wins What

CategoryWinnerMargin
Agentic codingOpus 4.7+6.6 points on SWE-bench Pro
Tool use at scaleOpus 4.7+9.2 points on MCP-Atlas
Computer useOpus 4.7+3.0 points on OSWorld-Verified
Vision resolutionOpus 4.72,576px vs lower GPT-5.4 defaults
Raw reasoningOpus 4.7+4.2 points on HLE (no tools)
Pricing transparencyOpus 4.7Flat $5/$25 vs tiered GPT-5.4
Agentic web researchGPT-5.4 Pro+10 points on BrowseComp
Tool-assisted searchGPT-5.4 Pro+4.0 points on HLE (with tools)
Graduate reasoningEffective tie94.2% vs 94.4% on GPQA Diamond

The short version: if your workload is agentic coding, multi-tool orchestration, or computer use, Opus 4.7 is the correct default. If you are building a research agent that lives inside a browser, GPT-5.4 Pro still has the edge.

Release Context and the Mythos Caveat

Anthropic released Claude Opus 4.7 on April 16, 2026, positioning it as the most capable generally available model in their lineup. There is a caveat worth stating upfront. Opus 4.7 is not Anthropic's most capable model internally. That is Mythos Preview, which Anthropic is holding back for further alignment work and using as a testing ground for stricter safety frameworks. Mythos Preview posts a 93.9% on SWE-bench Verified and 77.8% on SWE-bench Pro per Anthropic's system card. Those numbers are not competing in the market yet.

GPT-5.4 shipped earlier in the year as OpenAI's frontier coding and reasoning model. VentureBeat framed the Opus 4.7 launch as "narrowly retaking the lead for most powerful generally available LLM," which matches what the benchmarks show. Opus 4.7 does not dominate. It wins more categories than it loses, and it wins them by meaningful margins in the workloads most developers care about.

Both models target different audiences in practice. Opus 4.7 is the default in Claude Code, Cursor, and a long list of developer tooling. GPT-5.4 anchors OpenAI's agent stack and the Microsoft ecosystem through Azure and Copilot products. The comparison below reflects how each model performs when you put them side by side on the same tasks.

Agentic Coding: SWE-bench Pro and Friends

Agentic coding is the workload that separates frontier models from lab curiosities. It tests whether a model can plan, execute, self-correct, and ship working code across multi-file codebases.

Opus 4.7 leads the field on every coding benchmark where direct comparison is possible:

BenchmarkOpus 4.7GPT-5.4Gemini 3.1 ProOpus 4.6
SWE-bench Pro64.3%57.7%54.2%53.4%
SWE-bench Verified87.6%N/A published80.6%80.8%
Terminal-Bench 2.069.4%75.1%*68.5%65.4%
CursorBench (per Cursor)70%not disclosed--58%

All scores from Anthropic's Opus 4.7 system card unless noted. The Terminal-Bench 2.0 asterisk matters: GPT-5.4's 75.1% uses OpenAI's self-reported harness, which Vellum's benchmark breakdown flags as "not directly comparable" to Anthropic's Terminus-2 harness. That makes Terminal-Bench a directional signal, not a definitive one.

SWE-bench Pro is the number that matters most. It uses held-out repositories and memorization screens to catch models that have seen the test cases during training. Opus 4.7's 64.3% is the highest score any generally available model has posted on it. The +10.9-point jump from Opus 4.6 is the largest single-release coding gain in Anthropic's history.

The customer data backs this up. Cursor measured CursorBench moving from 58% on Opus 4.6 to 70% on Opus 4.7. Linear reported a 13% resolution lift on a 93-task internal coding benchmark. Factory saw "10% to 15% lift in task success" with fewer tool errors. Cognition, which builds Devin, described the model as working "coherently for hours, pushes through hard problems." Rakuten reported 3x more production tasks resolved compared to Opus 4.6 with double-digit gains in both code quality and test quality.

What these numbers translate to operationally: fewer aborted sessions, fewer PRs that fail review, fewer instances of agents giving up halfway through complex refactors. If your workload involves multi-file code changes, Opus 4.7 is the correct choice.

For GPT-5.4, the honest read is that it is a strong coding model that Opus 4.7 is now clearly ahead of on the hardest benchmarks. OpenAI's own Terminal-Bench score suggests GPT-5.4 still has advantages in command-line tasks, but the harness caveat prevents a clean head-to-head.

Tool Use: MCP-Atlas and Real Agent Work

Tool use is where agents live or die. Every production agent makes dozens of tool calls per task, and a model that picks the wrong tool or formats arguments badly burns tokens on recovery attempts.

BenchmarkOpus 4.7GPT-5.4Opus 4.6
MCP-Atlas77.3%68.1%62.7%
Finance Agent v1.164.4%61.5%60.1%

The MCP-Atlas margin is the single largest gap in the entire Opus 4.7 vs GPT-5.4 comparison. 9.2 points on a benchmark built to stress multi-tool orchestration across MCP servers means Opus 4.7 picks the right tool more often, formats arguments more reliably, and recovers from tool errors with less churn.

Notion confirmed this pattern on their own infrastructure, reporting that Opus 4.7 achieved "+14% over Opus 4.6 at fewer tokens and a third of the tool errors." A third the tool errors is not a small improvement. In a 50-step agentic loop, reducing tool errors by two-thirds compounds dramatically, which is why customers running long agentic traces see outsized gains.

The behavior change under the hood is that Opus 4.7 makes fewer tool calls by default, relying on reasoning instead. Raising the effort level increases tool usage. This is the opposite of the Opus 4.6 default, which leaned tool-heavy. For developers migrating prompts, the fix is usually to push effort to xhigh for tool-heavy workloads and let the model reason before acting.

GPT-5.4 remains strong at tool use but is a full 9.2 points behind on the benchmark built specifically to measure it. For teams wiring MCP servers into production agents, this gap is the single most important number to evaluate.

Computer Use: Vision Meets Action

Computer use is where the 2,576px vision upgrade translates into measurable task success.

BenchmarkOpus 4.7GPT-5.4Opus 4.6
OSWorld-Verified78.0%75.0%72.7%

A 3-point lead on OSWorld-Verified is meaningful because the benchmark combines vision, planning, and tool execution in ways that mirror real desktop automation. The reason Opus 4.7 pulls ahead is mechanical. The model now accepts images at 2,576 pixels on the long edge (roughly 3.75 megapixels), more than three times what prior Claude models supported. Coordinates map 1:1 to pixels, so there is no scale-factor math required when directing clicks.

What this means in practice: a screenshot of a dense dashboard now renders with enough fidelity that the model can read small labels and dropdown values it would have hallucinated on Opus 4.6. XBOW's visual-acuity benchmark, which tests fine detail reading, jumped from 54.5% on Opus 4.6 to 98.5% on Opus 4.7. That is not incremental. That is the first Claude model where computer-use agents can reliably read what they are looking at.

GPT-5.4 supports image inputs but does not publish directly comparable resolution limits in the same format. For teams building browser automation, UI testing harnesses, or any agent that needs to read dense visual content, Opus 4.7 is the model to start with.

Reasoning: The Benchmark That's Nearly Tied

Graduate-level reasoning is effectively a wash at the frontier.

BenchmarkOpus 4.7GPT-5.4 ProGemini 3.1 ProMythos Preview
GPQA Diamond94.2%94.4%94.3%94.6%
HLE (no tools)46.9%42.7%44.4%56.8%
HLE (with tools)54.7%58.7%51.4%64.7%

On GPQA Diamond, all three models sit within 0.4 percentage points. Vellum's analysis calls differences at this level "within noise." The story gets more interesting on HLE, Humanity's Last Exam. Without tools, Opus 4.7 leads GPT-5.4 Pro by 4.2 points, indicating stronger internal knowledge and reasoning before any tool access. With tools enabled, GPT-5.4 Pro takes the lead by 4.0 points, suggesting OpenAI's tool integration currently routes research queries more effectively.

The practical read: for pure reasoning tasks where the model works from context, Opus 4.7 has a small edge. For research workloads where tool access is standard, GPT-5.4 Pro currently executes the tool choreography better. The gap is small enough that either model works for most reasoning tasks. The shape of your agent architecture matters more than the model choice here.

Opus 4.7 also ships with adaptive thinking as the only thinking-on mode. Fixed thinking budgets are gone. The model decides how long to reason based on task complexity. This is closer to how GPT-5.4 Pro has handled reasoning for a while, so migrating prompts across models has become marginally easier.

Agentic Search and Browsing: Where GPT-5.4 Wins

BenchmarkOpus 4.7GPT-5.4 ProOpus 4.6
BrowseComp79.3%89.3%83.7%

This is the one benchmark where Opus 4.7 not only loses but regresses from the prior Claude generation. BrowseComp tests agentic web research: the model has to navigate, read, cross-reference, and synthesize information across multiple pages. GPT-5.4 Pro's 89.3% is a 10-point lead over Opus 4.7 and a meaningful gap over every other model in the comparison.

If your workload is deep web research, competitor monitoring, or any agent that spends most of its time reading and synthesizing pages, GPT-5.4 Pro is currently the stronger choice. Anthropic has not publicly commented on why Opus 4.7 regressed on BrowseComp specifically. Possibilities include the stricter instruction-following behavior reducing exploratory browsing, or the updated tokenizer changing how long pages get processed. Either way, the measured gap is real.

For everything else that involves tool use and synthesis, Opus 4.7 pulls ahead. The BrowseComp result is the main data point that prevents the comparison from being a clean sweep.

Vision: The Generational Leap

Opus 4.7's vision upgrade is the most pronounced single-release capability jump in the comparison.

Vision SpecOpus 4.7Opus 4.6
Max resolution2,576px long edge1,568px long edge
Max megapixels3.75 MP1.15 MP
XBOW Visual Acuity98.5%54.5%
CharXiv (with tools)91.0%84.7% (Opus 4.6)
Coordinate mapping1:1 with pixelsscale-factor math

Reading dense technical diagrams, chemical structures, architectural blueprints, or high-resolution code screenshots is now reliable. Solve Intelligence confirmed Opus 4.7 reads chemical structures correctly that prior models garbled. Databricks reported 21% fewer errors than Opus 4.6 on OfficeQA Pro, which benchmarks document understanding.

GPT-5.4 supports images, and OpenAI has not published resolution specs in a directly comparable format. For vision-heavy workloads, particularly computer use and document extraction, Opus 4.7 is the stronger default. The 1:1 coordinate mapping also removes a common failure mode where models with scale-factor conversion point their clicks at the wrong pixel.

The practical caveat: higher-resolution images consume more tokens. If fine detail is not necessary for your use case, downsample before sending. For workloads where detail matters (medical images, engineering diagrams, dense dashboards), the token cost is worth it.

Pricing and Availability

Pricing is where the comparison gets structurally different. Anthropic keeps it simple:

TierOpus 4.7
Input$5 per 1M tokens
Output$25 per 1M tokens
Prompt cachingUp to 90% savings
Batch processing50% savings
US-only inference1.1x standard

Pricing is flat across all Opus 4.7 deployments. The 1M token context is available at standard API pricing with no long-context premium.

GPT-5.4 Pro pricing is tiered with structural variation depending on tier and deployment. Anthropic publishes one price sheet. OpenAI publishes several. For procurement, this means you can estimate Opus 4.7 costs directly from token counts. GPT-5.4 estimates require cross-referencing tier documentation. Teams doing serious cost modeling tend to benchmark both on their actual workloads before committing.

The tokenizer asterisk. Opus 4.7 uses a new tokenizer that may map identical input to roughly 1.0 to 1.35x more tokens than Opus 4.6. For most content types the overhead is small. For multilingual content and heavily structured text, it can approach the 1.35x upper bound. If you are cost-sensitive, run /v1/messages/count_tokens on representative traffic before switching model defaults. This is the main reason existing Opus 4.6 users see higher bills on Opus 4.7 even though the per-token price is unchanged.

Platform availability favors Opus 4.7 for teams that want to standardize on a single deployment story:

PlatformOpus 4.7GPT-5.4
Native APIYesYes
Amazon BedrockYesLimited
Google Vertex AIYesNo
Microsoft FoundryYesYes
GitHub CopilotYes (7.5x premium multiplier promo, April 30)Yes
Claude CodeDefaultNo
CursorDefaultYes

Opus 4.7's cross-cloud availability on Bedrock, Vertex, and Foundry on day one is meaningful for enterprise procurement. GitHub Copilot's 7.5x premium multiplier promo on Opus 4.7 runs until April 30, 2026, which makes it a cheap way to test the model inside existing Copilot subscriptions before committing to API spend.

Which Should You Use?

The decision matrix below reflects what each model does best. Pick based on your dominant workload.

Use CaseRecommended ModelWhy
Agentic coding in Claude Code or CursorOpus 4.7+6.6 SWE-bench Pro, +12 CursorBench lift
Multi-tool agents (MCP, function calling)Opus 4.7+9.2 MCP-Atlas, 1/3 the tool errors of 4.6
Computer use and UI automationOpus 4.7+3.0 OSWorld, 2,576px vision, 1:1 coordinates
Document extraction and OCR-heavy pipelinesOpus 4.798.5% visual acuity, 3.75MP image support
Legal and financial document reviewOpus 4.790.9% BigLaw Bench, 64.4% Finance Agent v1.1
Long-running autonomous sessionsOpus 4.7Works "coherently for hours" per Cognition
Agentic web research with heavy browsingGPT-5.4 Pro+10 points BrowseComp, material lead
Tool-assisted analyst or research queriesGPT-5.4 Pro+4.0 HLE with tools, narrow but real
Graduate-level reasoning from contextEitherGPQA Diamond tied at 94.2-94.4%
Cost-sensitive high-volume generationNeither (use Haiku or smaller tier)Opus pricing doesn't fit

If you are already running a GPT-5.4 pipeline, do not flip traffic wholesale. Run a structured pilot on your top three workloads and measure on your real data. Public benchmarks are directional. Your specific prompt shape and tool chain may weight the differences up or down.

If you are running Opus 4.6, the upgrade to 4.7 is straightforward for coding and agentic work. Plan for the tokenizer change by running a cost baseline on representative traffic, then update your default. For model selection strategy across Claude's full lineup, the same rules apply: Sonnet for daily coding, Opus for complex work, switch models mid-session when the task demands it.

For anyone starting fresh in April 2026 on greenfield agent work, Opus 4.7 is the correct default. Most developers working through Claude Code are already on it. The Opus 4.7 best practices guide covers the operational shifts: detailed plans, explicit agent specifications, and the xhigh effort default that the Code Kit ships with. If you are running a multi-agent pipeline and comparing Claude Code vs Cursor, both IDEs now default to Opus 4.7, so the model choice is less of a differentiator than it was with Opus 4.6.

For teams that want the full Opus 4.7 operational stack without building it from scratch, the Code Kit ships with the /team-plan to /build pipeline pre-tuned for 4.7's literal instruction following and reduced subagent spawning. The 18 specialist agents and mandatory plan-reading pattern were designed for exactly the kind of model that punishes vague prompts and rewards explicit scope.

FAQ

Is Opus 4.7 better than GPT-5.4 for coding?

Yes, on the benchmarks that measure real-world coding. Opus 4.7 leads SWE-bench Pro by 6.6 points (64.3% to 57.7%) and posts strong customer wins across Cursor, Linear, Factory, and Cognition. GPT-5.4 still wins Terminal-Bench by a reported 5.7 points, but that benchmark uses OpenAI's self-reported harness, which is not directly comparable to Anthropic's. For agentic coding workloads inside Claude Code, Cursor, or custom pipelines, Opus 4.7 is the correct default.

How does Opus 4.7 pricing compare to GPT-5.4?

Opus 4.7 runs $5 per million input tokens and $25 per million output tokens, flat, with 50% batch discounts and up to 90% prompt caching savings. GPT-5.4 Pro pricing is tiered and varies by deployment. The updated tokenizer in Opus 4.7 may produce up to 35% more tokens than Opus 4.6 on identical input, so real costs can creep up even though the per-token rate is unchanged. Run token counts on your actual traffic before committing to budgets.

What is the xhigh effort level?

Opus 4.7 introduced a new effort tier called xhigh that sits between high and max. Claude Code defaults to xhigh across all plans. It gives most of the reasoning depth of max without the full token cost. The effort ladder is now low, medium, high, xhigh, max.

Does GPT-5.4 beat Opus 4.7 on anything important?

Yes. GPT-5.4 Pro leads BrowseComp (agentic web research) by 10 points, 89.3% to 79.3%. It also has a narrow 4-point lead on HLE with tools enabled. For browsing-heavy research agents, GPT-5.4 is the stronger default. Everything else in the developer stack (coding, multi-tool use, computer use, vision) tilts to Opus 4.7.

How is Opus 4.7 vs Gemini 3.1 Pro for coding?

Opus 4.7 wins SWE-bench Pro by 10.1 points (64.3% to 54.2%) and SWE-bench Verified by 7.0 points (87.6% to 80.6%). Gemini 3.1 Pro offers a 2M token context window versus Opus 4.7's 1M, and undercuts on pricing at $2 input and $12 output per million tokens. For coding workloads, Opus 4.7 is ahead. For long-context retrieval or cost-sensitive classification, Gemini 3.1 Pro deserves evaluation.

What breaking changes ship with Opus 4.7?

Three API changes require migration on the Messages API (Claude Managed Agents users are unaffected). Sampling parameters (temperature, top_p, top_k) return a 400 error if set to non-default values. Extended thinking budgets (thinking.budget_tokens) return 400; adaptive thinking is the only thinking-on mode. Thinking display defaults to omitted, so thinking content is not returned unless you set display: "summarized". Task budgets are available in public beta via the task-budgets-2026-03-13 beta header, with a 20k token minimum.

Last updated on

On this page

Stop configuring. Start shipping.Everything you're reading about and more..
Agentic Orchestration Kit for Claude Code.
Claude Fast Code Kit v5.2
New

Code Kit v5.2 is here

Optimized for Claude Opus 4.7. Detailed plans, explicit agents, tighter scoping, sharper output.

Learn more