FreeAI.DevTools
Model Reviews · May 2026 · 12 MIN

Claude Sonnet 4.6 vs GPT-5: I ran 200 prompts. Here's what won.

An apples-to-apples comparison across coding, summarization, JSON output, and reasoning. Verdicts by task type, not by vibes. Same prompts, same temperature, same eval rubric.

TL;DR. Claude Sonnet 4.6 wins coding (better structure, fewer subtle bugs) and long-document analysis (1M context, stable recall). GPT-5 wins JSON-output reliability, raw speed, and cost (about 50% cheaper at a typical mix). Pick by task, not by team affinity.

The methodology

200 prompts across four categories (50 each): code generation, document summarization, structured output (JSON schema generation), and multi-step reasoning. Same prompt to both models, temperature=0.2, seed=42, single API call per prompt (no agentic loops). Output graded by a strict rubric: factual accuracy, schema compliance, structural quality, runtime correctness for code.

The two models compared:

  • Claude Sonnet 4.6. Anthropic's flagship-tier. $3 input, $15 output per million tokens. 1,000,000 context. Excellent reasoning.
  • GPT-5. OpenAI flagship. $1.25 input, $10 output per million. 400,000 context. Excellent reasoning.

§1. Coding (50 prompts)

Winner: Claude Sonnet 4.6. 86% pass rate on first run versus GPT-5's 78%. Both produce code that compiles. The difference shows up in structural decisions and subtle bugs.

Where Claude pulled ahead:

  • Edge cases. Asked to write a parseDate function, Claude handled the empty-string case unprompted. GPT-5 threw on empty input until I asked.
  • Type signatures. Claude defaulted to stricter TypeScript types: readonly arrays, const assertions, narrower union types. GPT-5 tended toward looser string | number defaults that compiled but allowed misuse.
  • File-level structure. When the prompt asked for a multi-file solution, Claude organized imports and exports more cleanly. GPT-5 occasionally circular-imported and needed a second prompt to fix.

Where GPT-5 pulled ahead: idiomatic Python (PEP 8 compliance was a hair tighter) and short scripts under roughly 30 lines, where its faster output and lower cost compounded across the test runs.

§2. Document summarization (50 prompts)

Winner: Claude Sonnet 4.6, narrowly. Both models produce factually accurate summaries. The gap is in structural quality and recall on long documents.

For documents under 50K tokens, the two were a wash. For documents in the 100K to 500K range, Claude's 1,000,000-token context held recall noticeably better than GPT-5's 400,000-token context. On a 350K-token legal document set, Claude correctly cited 18 of 20 buried facts. GPT-5 caught 14.

One important caveat. Both models hit lost-in-the-middle effects past around 100K tokens. The serious answer for very long documents is RAG against a smaller-context model, not raw large-context summarization. Use either model only when cross-document reasoning genuinely requires it.

§3. Structured output (50 prompts)

Winner: GPT-5. 96% schema-compliance versus Claude's 88%. This was the most clear-cut category.

GPT-5's structured-output mode (the response_format: json_schema parameter) hard-constrains the output to match a Zod-style schema. Claude's tool-use-as-output approach achieves the same intent but with slightly more occasional hallucinated keys, missing nullable fields, and string/number type ambiguities.

For production data extraction, classification, or any pipeline that hands off to typed downstream code, GPT-5 is the safer default. The 8% gap means meaningfully fewer downstream parse errors.

// ADVERTISEMENTAd space

§4. Multi-step reasoning (50 prompts)

Winner: Claude Sonnet 4.6, by a hair. 82% correct versus GPT-5's 78%. Both flagships sit at the Excellent reasoning tier in our rubric, and the gap is well within the margin of error.

Where the gap shows up: Claude was more willing to walk through reasoning steps before answering, even without a chain-of-thought prompt. GPT-5 frequently jumped to an answer and was slightly more likely to commit to a wrong conclusion confidently. Adding "think step by step" to the GPT-5 prompts closed the gap.

For genuinely hard reasoning (competitive math, multi-hop logic, complex code review), neither flagship is the right default. Step up to Claude Opus 4.7, GPT-5.5, or OpenAI's o3 or o3-pro reasoning models. The cost premium is worth it on tasks where accuracy carries downstream consequences.

Cost: GPT-5 wins by a meaningful margin

GPT-5 lists at $1.25/$10 per million tokens. Claude Sonnet 4.6 lists at $3/$15. Both offer prompt caching at roughly 10% of input price.

On a typical workload (1,000 input plus 500 output tokens per request, 1 million requests per month):

  • GPT-5: $1.25 per million times 1B input plus $10 per million times 500M output equals $6,250 input plus $5,000 output, so $11,250 a month.
  • Claude Sonnet 4.6: $3.00 per million times 1B input plus $15 per million times 500M output equals $3,000 input plus $7,500 output, so $10,500 a month at scale. Claude's lower input rate but higher output flips the math here when output volume is high. At the more typical 100-token-output mix, GPT-5 is roughly 30 to 40% cheaper.

For most production AI features, output is shorter than this example, and GPT-5 wins on cost overall. Run your specific token mix on the cost calculator to see exactly where your workload lands.

The decision tree

Walk this list. Take the first "yes":

  1. Production code generation? Pick Claude Sonnet 4.6.
  2. Strict JSON output or typed pipelines? Pick GPT-5.
  3. Long documents (100K+ tokens)? Pick Claude Sonnet 4.6, or Gemini 2.5 Pro for 1M+.
  4. Cost-sensitive at an output-heavy mix? Reconsider both. They're expensive. Look at GPT-5 Mini or Claude Haiku 4.5.
  5. Hard reasoning that matters? Step up to Claude Opus 4.7, GPT-5.5, or o3-pro.
  6. Default for everything else? GPT-5 (cost) or Claude Sonnet 4.6 (quality), depending on your priority.

What this comparison can't tell you

Two warnings before applying this to your workload. First, a 200-prompt eval is small. For mission-critical decisions, build an eval set of 1,000+ examples drawn from your actual production traffic and run it weekly as new model versions ship. Second, my rubric weighted code quality heavily. If your workload is creative writing, customer support, or content moderation, the rankings will shift.

The point isn't the specific verdict. It's the discipline: pick the model per task, not per team affinity. The teams I see overpaying for AI in 2026 are the ones who locked in on a single provider relationship in 2024 and never revisited.

See also