FreeAI.DevTools
Cost Optimization · May 2026 · 8 MIN

How I cut my OpenAI API bill 73% in a weekend

Three specific levers (prompt caching, batch API, and model routing) that took a real production AI workload from $1,800 a month to $480, without touching the eval set.

TL;DR. A customer-support classifier was spending $1,800 a month on GPT-5. Three changes (route easy intents to gpt-5-nano, enable prompt caching on the system message, and cap output length) cut it to $480. Same accuracy on the eval set. Total work: a Saturday afternoon.

The setup

The workload was a customer-support intent classifier. About 600,000 messages a month came in. Every message hit GPT-5 with a 4,500-token system prompt (rules, examples, schema) and got back a 60 to 250 token JSON response.

Quick math at GPT-5 list pricing ($1.25 input, $10 output per million tokens):

  • Input: 600k requests at roughly 4,700 input tokens (system plus user) equals 2.82 billion tokens. At $1.25 per million, that's $3,525 a month.
  • Output: 600k requests at 150 average output tokens equals 90 million tokens. At $10 per million, that's $900 a month.

Total list price: $4,425 a month. Negotiated enterprise pricing brought it to $1,800, which is still a real number for a feature with a fixed quarterly budget.

Lever 1. Prompt caching on the system prompt

The system prompt was 4,500 tokens, and it was identical on every call. That's exactly the shape OpenAI's prompt caching is designed for. Cached input tokens cost 10% of standard input on the GPT-5 family ($0.125 per million instead of $1.25).

Two prerequisites tripped me up the first time. First, the cached prefix has to be byte-identical across calls. Even one timestamped value in the system prompt breaks the cache. I had a {currentDate} placeholder in mine. I pulled it out and reformatted to write the date in the user message instead. Second, you need at least 1,024 tokens of stable prefix. Below that, caching is a no-op.

After fixing both, my cached input cost on the system prompt dropped to $0.125/M instead of $1.25/M. Working it out: 600k requests at 4,500 tokens at $0.125 per million equals about $337.50 a month for the system-prompt portion of input, down from roughly $3,375 at full price. That's a $3,000 a month input-cost win on its own.

You can model your own savings on the LLM cost calculator. Toggle prompt caching, set system-prompt length, and see what the breakeven looks like.

Lever 2. Model routing: easy intents go to gpt-5-nano

70% of traffic was three trivial intents: "where's my order," "cancel," and "reset password." A 4-token classifier could solve those. The other 30% (refunds, escalations, multi-issue threads) actually benefited from GPT-5's reasoning.

So I added a 50-line router. A tiny GPT-5 Nano call (cheap and fast) classifies each message into "simple" or "complex," then routes accordingly. Nano lists at $0.05/$0.40 per million, which is 25x cheaper than GPT-5 on input and 25x cheaper on output.

Reshaped budget after routing:

  • 420k simple-intent requests routed to nano. Cost: about $15 a month, all in.
  • 180k complex requests stayed on GPT-5 with the cached system prompt. Cost: about $140 a month.
  • 600k tiny router calls (nano, ~50 tokens each). Cost: about $8 a month.

Eval-set accuracy held: 96.2% before routing, 96.0% after. The 0.2% drop came entirely from edge cases the router miscategorized as simple. Adding a single rule (if the message contains refund or chargeback, force it to complex) recovered the gap.

// ADVERTISEMENTAd space

Lever 3. Cap output length

Output is the expensive half. GPT-5 charges 8x more per output token than per input token. The original prompt let the model generate a JSON response of any length. On complex tickets it was producing 350-token responses with a paragraph of reasoning baked in.

The downstream consumer didn't need the reasoning paragraph. Setting max_tokens=80 and tweaking the prompt to ask for a strict JSON-only response cut average output from 150 tokens to 70 tokens. That's a 53% reduction on the single most expensive line item.

The new bill

Final monthly cost after all three changes: about $480 a month, down from $1,800. Same accuracy. Slightly faster average latency, since nano handles the simple intents quicker than GPT-5 did.

The audit trail:

  • Prompt caching alone: about 70% input-cost reduction on the slow path.
  • Model routing: shifted 70% of traffic to a 25x cheaper model.
  • Output cap plus JSON-only: 53% reduction in output tokens.

Stack these together and the savings compound. Each lever individually is a 30 to 60% win. Combined, it's a 73% win.

What I'd do differently

Audit cost before writing the prompt, not three months in. The 4,500-token system prompt was longer than it needed to be. Half of it was example refunds I could have moved into a vector retrieval. If I'd built it caching-aware from day one, the "optimization" would just have been the original architecture.

For new workloads I now run the math in the cost calculator before writing any code, and I count tokens with the token counter on candidate prompts before they ship. Two minutes of math saves three months of bills.

The shortlist

  • Enable prompt caching on any system prompt over roughly 1,000 tokens that's reused more than once a minute.
  • Route 70% of trivial traffic to a budget tier (Nano, Flash-Lite, DeepSeek V4 Flash).
  • Cap max_tokens aggressively. Output is the expensive half.
  • Use the cost calculator to model the bill before shipping, not after.

Related Tools