Role-Specialized Agentic Coding
How a two-agent loop — SDM (expensive model, minimal output) and SDE (cheap model, heavy implementation) — achieves 80–90% cost reduction over single-model agentic coding.
The Waste in Single-Model Agentic Coding
The standard approach to agentic code generation is simple: give one capable model a task, let it plan, implement, and self-correct. This works, but it is economically backwards. You are paying frontier-model prices for every token the model generates — including the 80% of output that is repetitive scaffolding, boilerplate, and verbose self-correction.
Consider GPT-5.3 Codex generating a feature. The model might output 2,000 tokens of code. But before that, it outputs 1,500 tokens of reasoning. Then it spots an error and outputs 800 more tokens of self-correction. All of this is billed at the expensive output token rate. The ratio of valuable output to total output is often below 30%.
This is the core insight behind role specialization: not all tokens are created equal. Planning tokens require frontier-model intelligence. Implementation tokens do not — they are largely mechanical translation of a specification into code. Review tokens require frontier-model judgement again, but review output is naturally short.
The SDM/SDE Architecture
We designed a two-agent code generation loop modelled on the classic engineering management structure:
The Loop
| Step | Agent | Action | Output |
|---|---|---|---|
| 1 | SDM | Analyse requirements, create architecture plan | Plan document (structured, ~300 tokens) |
| 2 | SDE | Read plan, generate code, write to review doc | Code files + review document (~2000 tokens) |
| 3 | SDM | Read review document, perform code review | Feedback document (~200 tokens) |
| 4 | SDE | Improve code based on feedback | Updated code + review document |
| 5–6 | Repeat steps 3–4 until SDM approves or max 3 rounds | Final approved code | |
How the Cost Saving Works
The economics hinge on one property of LLM pricing: input tokens are much cheaper than output tokens, and the gap widens at the frontier. For GPT-5.3 class models, output tokens cost ~4–6x more than input tokens. For DeepSeek 4 Flash, both input and output are an order of magnitude cheaper.
The SDM model (expensive) is instructed to produce extremely concise output. Its job is to think, plan, and critique — all of which happens primarily in its internal reasoning (input processing). Its external output is minimised to a structured plan and short review bullets. It never writes code directly.
The SDE model (cheap) does the heavy lifting. It reads the plan, generates all the implementation code, writes it to the review document, and iterates based on feedback. Its output tokens are billed at the cheap rate — typically 10–20x cheaper per token than the frontier model.
Cost Model: Single-Model vs Two-Agent
Consider a real feature: "Add a paginated user listing endpoint with search and filter." This touches 3–5 files (controller, service, repository, tests) and produces ~300 lines of code. Here's the cost breakdown with realistic token counts.
| Metric | Single Model (GPT-5.3 only) | Two-Agent (SDM + SDE) |
|---|---|---|
| Expensive model input tokens | 25,000 (requirements + codebase context) | 40,000 (plan + 2 × reviews) |
| Expensive model output tokens | 12,000 (reasoning + 300 lines of code) | 1,200 (plan + feedback, concise) |
| Cheap model tokens | — | 30,000 input + 12,000 output |
| Total cost per task | $1.86 | $0.30 |
| Cost reduction | — | ~84% |
At 2,000 tasks per month (a team automating feature development, bug fixes, and test generation):
- Single-model approach: ~$3,720/month
- Two-agent approach: ~$600/month
- Annual savings: ~$37,440
These figures align with our production measurements, where we consistently observe 80–90% cost reduction depending on task complexity and the number of review rounds required.
Why Quality Doesn't Degrade
The natural concern is that using a cheaper model for implementation reduces code quality. In practice, the opposite occurs:
- The SDM's plan constrains the SDE's output space. A well-structured plan with explicit interfaces, data models, and acceptance criteria leaves little room for the SDE to diverge. The SDE becomes a mechanical translator of specification to code — a task at which smaller models excel.
- The review loop catches SDE errors. The SDM, reading the review document, spots deviations from the plan, suboptimal patterns, and missing edge cases. The SDE then fixes them. This is arguably better than a single model self-correcting, because the reviewer and implementer are independent systems with different failure modes.
- The cheap model can be pushed harder. Because SDE costs are low, we can afford multiple generation attempts, more thorough test generation, and longer output. The economic constraint that makes single-model systems produce minimal output doesn't apply to the SDE.
Implementation Details
The Plan Document
The SDM maintains a structured plan document with these sections:
- Architecture: Component breakdown, data flow, module boundaries
- Interfaces: Function signatures, type definitions, API contracts
- Data model: Schemas, validation rules, persistence strategy
- Acceptance criteria: Exact behaviours the implementation must satisfy
- Rejected approaches: What not to do, known pitfalls
The Review Document
The SDE writes each file it creates or modifies into a review document, alongside a summary of what was done and why. This document becomes the SDM's input for review. The SDM never reads raw code files — it reads the structured review document, which provides context and intent alongside the code.
Loop Termination
The loop runs a maximum of 3 rounds. In practice, ~70% of tasks pass review in round 1, ~25% require one revision, and ~5% require two. The third round is a safety net. After 3 rounds, the SDM either approves or flags the task for human review.
We track the pass rate per round as a health metric. A declining round-1 pass rate signals that the plan document is insufficiently detailed, and the SDM's prompting should be adjusted.
| Round | Pass Rate | Cumulative % of Tasks |
|---|---|---|
| 1 (initial implementation) | 70% | 70% |
| 2 (after first feedback) | 83% | 95% |
| 3 (after second feedback) | 60% | 98% |
| Escalated to human | — | 2% |
Beyond Cost: Additional Benefits
The role-specialized architecture produces advantages beyond the direct cost savings:
- Audit trail. The plan document, review document, and feedback are all persisted. Every decision is traceable. A human can review the entire conversation at any point.
- Independent iteration. The SDM and SDE can operate asynchronously. The SDM can plan the next task while the SDE implements the current one. With parallel task queues, throughput nearly doubles.
- Model independence. The SDM and SDE are decoupled by the document interface. Either model can be swapped independently. We have run experiments with Claude Opus 4.5 as SDM and Qwen 3 Coder as SDE, with similar cost profiles.
- Incremental improvement. Because the SDE's output is always reviewed by a stronger model, the system naturally drives quality up over time. The SDE learns from feedback patterns within each session.
Conclusion
The prevailing assumption in agentic coding is that you need the most capable model for every part of the workflow. This is economically wrong. By splitting the cognitive labour — planning and reviewing to the frontier model (minimal output), implementation to a cheap model (heavy output) — we achieve the same or better quality at 80–90% lower cost.
The SDM/SDE pattern is not a compromise. It is a principled application of comparative advantage to AI agents. Each model does what it does best. The expensive model thinks. The cheap model builds. The loop connects them.
This architecture is now the default for all code generation tasks at Northorp. We recommend it to any team generating more than 5,000 lines of agent-written code per month — the savings pay for the integration effort within the first week.