How We Helped a Client Cut AI Development Costs by 80% With a Two-Model Workflow

A client came to us with a problem that is becoming increasingly common. They had adopted AI-assisted development across their engineering team — tools like GitHub Copilot, ChatGPT, Claude, and various API integrations — and the monthly bill had ballooned to over $8,000. Their developers were running every task through the most expensive model available, whether it was architecting a new microservice or adding a CSS class.

We restructured their AI workflow and cut that bill to under $1,600. Same output. Same quality. Here is how.

The Problem: One Model for Everything

Most AI coding tools default to sending your entire conversation history with every message. Every file the AI has read, every code block it has written, every back-and-forth exchange. A long session working on a feature can easily hit hundreds of thousands of tokens.

When you run all of that through a premium model, the costs add up fast. Here is what the major providers charge for their top-tier models in 2026:

Provider	Premium Model	Input (per 1M tokens)	Output (per 1M tokens)
OpenAI	GPT-4o	$2.50	$10.00
OpenAI	o1	$15.00	$60.00
Anthropic	Claude Opus	$15.00	$75.00
Anthropic	Claude Sonnet	$3.00	$15.00
Google	Gemini 2.5 Pro	$1.25 - $2.50	$10.00 - $15.00

The price difference between tiers is 3x to 10x depending on the provider. Our client's team was running everything through GPT-4o and Claude Opus because those were the defaults. Nobody had asked whether every task actually needed the most capable model.

The answer, it turns out, is no. Not even close.

The Insight: 80% of AI Coding Work Is Execution, Not Reasoning

We audited a month of their AI usage and categorized every interaction. The breakdown was clear:

~20% was complex reasoning — architecture decisions, debugging subtle issues, designing data models, evaluating tradeoffs between approaches
~80% was straightforward execution — file edits, code generation from clear requirements, writing tests, refactoring, boilerplate, git operations, formatting

The execution work does not need a $75-per-million-output-token model. A mid-tier model follows detailed instructions just as well as a premium one. It just cannot originate those instructions as effectively from a vague conversation. That distinction is the key to the entire optimization.

The Fix: Two Models, Two Jobs

We restructured their workflow around a simple principle: use the expensive model to think, use the cheap model to build.

Every AI provider has a model tier that is excellent at following detailed instructions but costs a fraction of the premium tier. OpenAI has GPT-4o Mini. Anthropic has Claude Sonnet and Haiku. Google has Gemini Flash. These models are not inferior — they are optimized for different workloads.

The restructured workflow looks like this:

Step 1: Default to the Mid-Tier Model

Every coding session starts on the cheaper model. If the task is straightforward — add a field, fix a CSS issue, write a unit test, refactor a function — it stays on the cheaper model the entire time. Most tasks are straightforward.

Step 2: Escalate to Premium for Strategy

When the team needs to think through something complex — designing a new API, debugging a race condition, planning a database migration — they switch to the premium model. They go back and forth until the approach is solid. This might take 5 to 10 messages.

Step 3: Have the Premium Model Write a Detailed Spec

This is the critical step. Before switching back to the cheaper model, the premium model writes a detailed implementation specification. Exact files to create or modify, data structures, function signatures, edge cases to handle, validation steps. Every decision is made. Nothing is left ambiguous.

Step 4: Hand the Spec to the Mid-Tier Model

The cheaper model reads the spec and executes it. It creates files, writes code, runs tests. Because the spec is detailed and unambiguous, the mid-tier model executes it cleanly — often producing identical output to what the premium model would have generated at 5x the cost.

The Spec Is What Makes It Work

A vague spec produces vague results regardless of the model. The quality of the handoff document determines the quality of the output. We helped the client standardize on a template:

# Task: [Short Title]

## Goal
[1-2 sentences. What should be different when done.]

## Context
[Relevant background. Why this change is needed.]

## Files to Modify
- path/to/file.tsx — [what changes and why]

## Files to Create
- path/to/new-file.tsx — [what it does]

## Detailed Spec
[Exact implementation details. Data structures,
function names, return types, error handling.
Every decision made. Nothing left ambiguous.]

## Validation
- [ ] Tests pass
- [ ] Build completes without errors
- [ ] [specific acceptance criteria]

The key is the Detailed Spec section. If the premium model does its job, that section contains enough information that any competent developer — or any competent AI model — could implement it without making judgment calls. That is what makes the cheaper model work just as well for execution.

The Math

Here is what the client's monthly AI spend looked like before and after the change:

Before (all tasks on premium models):

Usage	Tokens (M)	Monthly Cost
Premium models (all work)	~150M	~$8,200

After (20/80 split between premium and mid-tier):

Usage	Tokens (M)	Monthly Cost
Premium models (strategy only)	~30M	~$1,100
Mid-tier models (execution)	~120M	~$480
Total	~150M	~$1,580

Same total token volume. Same development velocity. 80% cost reduction.

This Applies to Every AI Provider

The two-model workflow is provider-agnostic. The specific models do not matter. What matters is the pattern:

Provider	Strategy Model	Execution Model	Approx. Savings
OpenAI	GPT-4o / o1	GPT-4o Mini	60-85%
Anthropic	Claude Opus	Claude Sonnet / Haiku	70-85%
Google	Gemini 2.5 Pro	Gemini Flash	60-80%
Mixed	Best-of-breed	Best-of-breed	Varies

Some teams even mix providers — using Claude Opus for architecture and GPT-4o Mini for bulk code generation, for example. The models do not need to come from the same vendor. They just need to operate on a shared specification that any model can execute.

Other Optimizations That Stack

The two-model split was the biggest win, but we implemented several other changes that compounded the savings:

Shorter conversations. AI coding tools resend the full conversation history with every message. A 50-message conversation means the AI re-reads everything 50 times. Five 10-message conversations use far fewer total tokens than one 50-message session. Start a new session for each discrete task.

Context compression. Most AI coding tools have a way to compress or summarize the conversation mid-session. Use it aggressively once the early context is no longer needed. This reduces the token payload on every subsequent message.

Selective file reading. When you ask the AI to read a 500-line file, those 500 lines enter the context window and get resent with every subsequent message for the rest of the session. Read only what you need. Specify line ranges when possible. Do not read entire files out of habit.

Prompt caching. Both OpenAI and Anthropic offer prompt caching that gives 50-90% discounts on repeated context. If your tool or API integration supports it, enable it. System prompts, large file contents, and specification documents that persist across messages benefit significantly from caching.

The Real Benefit Is Better Code

Here is the thing nobody expects: the two-model workflow actually produces better results than running everything on the premium model. The reason is that writing a clear specification forces you to think through the design before any code gets written. The spec catches ambiguities, missing requirements, and architectural problems that would otherwise surface mid-implementation.

Without the spec, developers were having long, meandering conversations with the AI, course-correcting as they went. The AI would write code, the developer would say "no, not like that," the AI would rewrite, and the cycle would repeat. Every revision burned tokens and produced inconsistent results.

With the spec, the design phase is explicit and the implementation phase is clean. Fewer revisions. More consistent output. Better documentation as a side effect, since the spec itself becomes a record of what was built and why.

Getting Started

If your team is spending more than you expected on AI development tools, start with these steps:

Audit your usage. Most API providers have dashboards showing token consumption by model. Find out where the money is going.
Categorize the work. Look at what your team is actually asking the AI to do. Most of it will be execution, not reasoning.
Set the default to mid-tier. Change your team's default model configuration to the cheaper option. Make premium the exception, not the rule.
Standardize the handoff. Create a spec template and train the team to use it when escalating to the premium model. The quality of the spec determines the quality of the execution.
Measure the results. Compare the next month's bill against the baseline. The savings are usually immediate and dramatic.

AI-assisted development is one of the most significant productivity improvements in software engineering in decades. But like any tool, using it effectively requires understanding what each component is good at and matching the tool to the task. The most expensive model is not always the best choice. More often than not, a clear plan and a fast model will outperform an expensive model working from a vague prompt.

At Agave IS we help engineering teams optimize their AI development workflows: model selection, prompt engineering, cost management, and on-premises and local AI deployment when it is cheaper to own the hardware than to rent it by the token. If your AI bill is growing faster than your output, we should talk.

Websites & Local SEO

Development

AI & Infrastructure

Commerce & Trust