Your AI Coding Agent Doesn't Need a Better Prompt

The first time an AI agent one-shots a working app from a paragraph of description, it feels like magic. You type a wish, you get software. So you do it again, and again, and for a while the magic holds.

Then you point the same agent at the codebase you actually get paid to work on. Two hundred thousand lines. Five years of decisions, half of them load-bearing and undocumented. A test suite that takes nine minutes to run. And the magic curdles. The agent confidently edits the wrong module, reinvents a utility you already have three of, breaks an invariant nobody wrote down, and hands you a diff that looks plausible and is quietly wrong.

The instinct at this point is to write a better prompt. Add more detail. Be more specific. Try again. That instinct is wrong, or at least it stops scaling fast. The problem was never the wording of your request. The problem is everything the model could not see.

The bottleneck is context, not intelligence

It is tempting to treat a struggling agent as a model problem, something the next release will fix. That's typically not the case. The models are already strong enough to write the code you need. What they lack is a faithful picture of your system at the moment they act.

A language model is stateless. Every request starts from nothing, and everything it knows about your task has to fit inside a single finite context window: your instructions, the relevant files, the conventions, the prior steps, the error output. That window is a budget, and it is smaller than your codebase. Fill it with the wrong things and the right things never make it in. Stuff it to the brim and quality degrades anyway, because models attend unevenly to long inputs and important details in the middle get lost.

So the real question stops being “what should I ask for?” and becomes “what should the model see when I ask?” That shift, from prompting to managing the context window as a scarce resource, is the whole game.

Context engineering: curating what the model sees

Once you accept that context is finite and precious, a set of moves falls out of it.

You give the agent durable, always-loaded ground truth: the conventions, the architectural rules, the things that are true on every task and should never have to be rediscovered. You pull in the specific files that matter for this change, and leave out the thousands that don't, which is a retrieval problem, not a prompting one. You prune aggressively, because every token spent on something irrelevant is a token stolen from something that matters. And you keep the working context clean as a task runs, so the model isn't reasoning over its own stale dead ends.

This is what context engineering actually is. Not clever phrasing. The deliberate, ongoing work of deciding what enters the window and what stays out, so that the model is always operating on a true and relevant picture of the system.

A prompt is a wish. A spec is a contract.

Even a perfectly informed agent can build the wrong thing, beautifully, if the only target it has is a vague sentence. “Add saved searches with alerts” can be satisfied a hundred different ways, and the agent will pick one, commit to it, and present it as finished. You discover the gap in review, or worse, in production.

The fix is to stop handing agents wishes and start handing them specifications. A spec states, concretely and verifiably, what the feature must do to be correct: the behavior, the edge cases, the acceptance criteria, what is explicitly out of scope. It is the difference between “I'd like a deck” and a blueprint. The spec becomes the source of truth that the code is measured against, not a suggestion the model interprets.

This matters more, not less, as agents get more capable. A more powerful agent given an ambiguous target just produces a more convincing version of the wrong thing.

The loop that makes it repeatable

A spec on its own is a document. What turns it into delivered software is a loop you run on purpose, the same way every time.

You start from durable project rules. You write a spec for the feature. You let the agent turn that spec into a plan, and the plan into small, reviewable tasks. The agent builds one task at a time, against the spec, and each step is small enough that you can actually check it before moving on. Then you verify the result against the acceptance criteria you wrote at the start.

The point of the loop is not ceremony. It is that each step is bounded and inspectable. You are never staring at a thousand-line diff trying to reverse-engineer what the agent decided. You are reviewing a small change against a target you defined. That is the difference between supervising an agent and cleaning up after one.

Verification is the other half of the job

Agents are fluent, and fluency reads as competence even when it isn't. An agent will tell you the change is done, the tests pass, the edge cases are handled, with exactly the same confidence whether or not any of it is true. If your verification strategy is reading the summary and nodding, you are vibe-checking, not verifying.

The way out is to make correctness checkable by something other than your own optimism. Acceptance criteria become tests. Tests become the target the agent has to satisfy. Evaluations tell you whether the behavior actually holds, repeatably, instead of once by luck. You move from trusting the agent's account of its work to having an independent signal about it.

The hard part is the codebase you already have

Most of the writing about AI coding assumes a blank page. Real work rarely starts there. It starts in a brownfield codebase with existing patterns, existing debt, and existing reasons things are the way they are.

In that setting, the discipline pays off most. You anchor the agent to the patterns that already exist so it extends your system instead of bolting a foreign one onto the side. You point it at the services and modules to reuse, by name, so it stops reinventing them. And once more than one person is doing this, it becomes a team practice: shared conventions, shared specs, shared standards for what “verified” means, so the method is something an organization runs, not a trick one developer keeps in their head.

This is a method, not a tool you buy

None of this is tied to a particular vendor or a particular model. It is a way of working: treat context as a finite resource, engineer what the agent sees, write specs instead of wishes, run a bounded loop, and verify against a real target. The tools will keep changing. The discipline is what carries across them.

I spent a long time figuring out how to put all of this together in a way that holds up on production code, and then wrote it down. Beyond the Prompt: Spec-Driven Development and Context Engineering for AI Coding Agents is the full version of this argument, in much more depth than a blog post can hold. It walks through the context window as a resource, the core context-engineering moves, specs and the spec-driven loop, verification and evals, and the realities of brownfield work and team governance, with two worked examples you can clone and follow end to end: a brownfield analytics dashboard and a greenfield issue tracker, both taken from spec to verified feature.

If the wall I described at the top is one you have hit, the book is the map I wish I'd had. It's available now in paperback and Kindle. Find it on Amazon here.

Beyond the Prompt by Chris Tagliaferro: brilliant in demos, lost in your real codebase.

View the book on Amazon

Beyond the Prompt · paperback & Kindle