A client came to us with a problem that is becoming increasingly common. They had adopted AI-assisted development across their engineering team — tools like GitHub Copilot, ChatGPT, Claude, and various API integrations — and the monthly bill had ballooned to over $8,000. Their developers were running every task through the most expensive model available, whether it was architecting a new microservice or adding a CSS class.
We restructured their AI workflow and cut that bill to under $1,600. Same output. Same quality. Here is how.
The Problem: One Model for Everything
Most AI coding tools default to sending your entire conversation history with every message. Every file the AI has read, every code block it has written, every back-and-forth exchange. A long session working on a feature can easily hit hundreds of thousands of tokens.
When you run all of that through a premium model, the costs add up fast. Here is what the major providers charge for their top-tier models in 2026:
| Provider | Premium Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 |
| OpenAI | o1 | $15.00 | $60.00 |
| Anthropic | Claude Opus | $15.00 | $75.00 |
| Anthropic | Claude Sonnet | $3.00 | $15.00 |
| Gemini 2.5 Pro | $1.25 - $2.50 | $10.00 - $15.00 |
The price difference between tiers is 3x to 10x depending on the provider. Our client's team was running everything through GPT-4o and Claude Opus because those were the defaults. Nobody had asked whether every task actually needed the most capable model.
The answer, it turns out, is no. Not even close.
The Insight: 80% of AI Coding Work Is Execution, Not Reasoning
We audited a month of their AI usage and categorized every interaction. The breakdown was clear:
- ~20% was complex reasoning — architecture decisions, debugging subtle issues, designing data models, evaluating tradeoffs between approaches
- ~80% was straightforward execution — file edits, code generation from clear requirements, writing tests, refactoring, boilerplate, git operations, formatting
The execution work does not need a $75-per-million-output-token model. A mid-tier model follows detailed instructions just as well as a premium one. It just cannot originate those instructions as effectively from a vague conversation. That distinction is the key to the entire optimization.
The Fix: Two Models, Two Jobs
We restructured their workflow around a simple principle: use the expensive model to think, use the cheap model to build.
Every AI provider has a model tier that is excellent at following detailed instructions but costs a fraction of the premium tier. OpenAI has GPT-4o Mini. Anthropic has Claude Sonnet and Haiku. Google has Gemini Flash. These models are not inferior — they are optimized for different workloads.
The restructured workflow looks like this:
Step 1: Default to the Mid-Tier Model
Every coding session starts on the cheaper model. If the task is straightforward — add a field, fix a CSS issue, write a unit test, refactor a function — it stays on the cheaper model the entire time. Most tasks are straightforward.
Step 2: Escalate to Premium for Strategy
When the team needs to think through something complex — designing a new API, debugging a race condition, planning a database migration — they switch to the premium model. They go back and forth until the approach is solid. This might take 5 to 10 messages.
Step 3: Have the Premium Model Write a Detailed Spec
This is the critical step. Before switching back to the cheaper model, the premium model writes a detailed implementation specification. Exact files to create or modify, data structures, function signatures, edge cases to handle, validation steps. Every decision is made. Nothing is left ambiguous.
Step 4: Hand the Spec to the Mid-Tier Model
The cheaper model reads the spec and executes it. It creates files, writes code, runs tests. Because the spec is detailed and unambiguous, the mid-tier model executes it cleanly — often producing identical output to what the premium model would have generated at 5x the cost.
The Spec Is What Makes It Work
A vague spec produces vague results regardless of the model. The quality of the handoff document determines the quality of the output. We helped the client standardize on a template:
# Task: [Short Title]
## Goal
[1-2 sentences. What should be different when done.]
## Context
[Relevant background. Why this change is needed.]
## Files to Modify
- path/to/file.tsx — [what changes and why]
## Files to Create
- path/to/new-file.tsx — [what it does]
## Detailed Spec
[Exact implementation details. Data structures,
function names, return types, error handling.
Every decision made. Nothing left ambiguous.]
## Validation
- [ ] Tests pass
- [ ] Build completes without errors
- [ ] [specific acceptance criteria]
The key is the Detailed Spec section. If the premium model does its job, that section contains enough information that any competent developer — or any competent AI model — could implement it without making judgment calls. That is what makes the cheaper model work just as well for execution.
The Math
Here is what the client's monthly AI spend looked like before and after the change:
Before (all tasks on premium models):
| Usage | Tokens (M) | Monthly Cost |
|---|---|---|
| Premium models (all work) | ~150M | ~$8,200 |
After (20/80 split between premium and mid-tier):
| Usage | Tokens (M) | Monthly Cost |
|---|---|---|
| Premium models (strategy only) | ~30M | ~$1,100 |
| Mid-tier models (execution) | ~120M | ~$480 |
| Total | ~150M | ~$1,580 |
Same total token volume. Same development velocity. 80% cost reduction.
This Applies to Every AI Provider
The two-model workflow is provider-agnostic. The specific models do not matter. What matters is the pattern:
| Provider | Strategy Model | Execution Model | Approx. Savings |
|---|---|---|---|
| OpenAI | GPT-4o / o1 | GPT-4o Mini | 60-85% |
| Anthropic | Claude Opus | Claude Sonnet / Haiku | 70-85% |
| Gemini 2.5 Pro | Gemini Flash | 60-80% | |
| Mixed | Best-of-breed | Best-of-breed | Varies |
Some teams even mix providers — using Claude Opus for architecture and GPT-4o Mini for bulk code generation, for example. The models do not need to come from the same vendor. They just need to operate on a shared specification that any model can execute.
Other Optimizations That Stack
The two-model split was the biggest win, but we implemented several other changes that compounded the savings:
Shorter conversations. AI coding tools resend the full conversation history with every message. A 50-message conversation means the AI re-reads everything 50 times. Five 10-message conversations use far fewer total tokens than one 50-message session. Start a new session for each discrete task.
Context compression. Most AI coding tools have a way to compress or summarize the conversation mid-session. Use it aggressively once the early context is no longer needed. This reduces the token payload on every subsequent message.
Selective file reading. When you ask the AI to read a 500-line file, those 500 lines enter the context window and get resent with every subsequent message for the rest of the session. Read only what you need. Specify line ranges when possible. Do not read entire files out of habit.
Prompt caching. Both OpenAI and Anthropic offer prompt caching that gives 50-90% discounts on repeated context. If your tool or API integration supports it, enable it. System prompts, large file contents, and specification documents that persist across messages benefit significantly from caching.
The Real Benefit Is Better Code
Here is the thing nobody expects: the two-model workflow actually produces better results than running everything on the premium model. The reason is that writing a clear specification forces you to think through the design before any code gets written. The spec catches ambiguities, missing requirements, and architectural problems that would otherwise surface mid-implementation.
Without the spec, developers were having long, meandering conversations with the AI, course-correcting as they went. The AI would write code, the developer would say "no, not like that," the AI would rewrite, and the cycle would repeat. Every revision burned tokens and produced inconsistent results.
With the spec, the design phase is explicit and the implementation phase is clean. Fewer revisions. More consistent output. Better documentation as a side effect, since the spec itself becomes a record of what was built and why.
Getting Started
If your team is spending more than you expected on AI development tools, start with these steps:
- Audit your usage. Most API providers have dashboards showing token consumption by model. Find out where the money is going.
- Categorize the work. Look at what your team is actually asking the AI to do. Most of it will be execution, not reasoning.
- Set the default to mid-tier. Change your team's default model configuration to the cheaper option. Make premium the exception, not the rule.
- Standardize the handoff. Create a spec template and train the team to use it when escalating to the premium model. The quality of the spec determines the quality of the execution.
- Measure the results. Compare the next month's bill against the baseline. The savings are usually immediate and dramatic.
AI-assisted development is one of the most significant productivity improvements in software engineering in decades. But like any tool, using it effectively requires understanding what each component is good at and matching the tool to the task. The most expensive model is not always the best choice. More often than not, a clear plan and a fast model will outperform an expensive model working from a vague prompt.
At Agave IS we help engineering teams optimize their AI development workflows — from model selection and prompt engineering to infrastructure and cost management. If your AI bill is growing faster than your output, we should talk.