Skip to main content
← Blog
AIAgentsClaude CodeBuild

Why does my AI ship features I can break in two minutes?

A founder asleep at his desk while three hooded robot critics argue over a blueprint at night

I've been building a lot lately. Like, a lot. And one of my favourite things right now is handing Claude the night shift. I'll genuinely tell it have a good night shift, go to bed, and wake up to work that shipped while I slept — some days I feel like a rig crew running back-to-back shifts. It is so cool, getting to build whilst I sleep.

But as I ramp up development I keep getting into a series of bottlenecks. One is just my time to review everything, both the plans and the deployed projects. I needed to push Claude onto larger end-to-end tasks so I wasn't constantly hanging around waiting for it to ship a single feature. I needed it to take on a sprint, execute against it, and monitor itself. Early on, doing that, I felt like I had a very smart, hyper-energised fresh grad — producing a ton, but not thinking about what complete really means, especially for a startup. It's not signing off a ticket; it's asking whether you shipped anything of value. So I began to ask myself:

Why does my AI ship features I can break in two minutes?

That's the question I've been sitting with since April. Opus 4.7 writes the code, runs the tests, marks the ticket done. I open the feature, click around for two minutes, and I find a unit error in the aggregate, or a single-company view that's missing half its fields, or a button that works on the happy path and silently fails the moment a real user takes a slightly different route. The model isn't wrong. It built what I asked for. It just didn't build something a user could use.

You could say this is all my fault. I should be a better project manager. Tighter tickets, sharper acceptance criteria, less ambiguity in the prompt. Where I push back: if I really want to decouple development velocity from me, the way you do when you hire founding engineers and senior developers, I have to be able to hand off ambiguous tasks and trust people to take on the mantle and deploy value. It's a failure of process if I can find issues in minutes. My AI should be better. And this is, as I keep finding, ridiculously similar to growing a junior team. I saw it in new grads: given the nature of university, the desire to tick off the work and submit it was the objective function. Often it would be really well put together, complete on the surface — and then you present in an executive review, and within minutes an exec says that number "doesn't make sense." A stray calculation, no sanity-checking, never walking through the product as a user, an overconfidence in your own execution that kept you from going deep and asking whether you were right.

This was annoying me to no end, so over a weekend I built what follows. Honestly, I half-hope someone else ships it as a product first — Anthropic's dynamic workflows could even be that.

Two gripes, restated as one

Opus behaves like an over-eager intern. Smart, fast, ambitious to close the ticket, structurally incapable of asking what does the user actually do with this once it ships? When the instructions have a gap, it doesn't augment them — it simplifies them. Its objective function is getting back to me with a "done." It doesn't deeply review its own work, and it isn't critical enough of what it builds.

The model optimises for closing the ticket, not for landing the value.

Planning is the highest-leverage hour of the week

The /lc-chain review pipeline: router, tier decision, planner, council critics, author response, second round, wrap-up, and output docs

In any real product org you have QBRs, product design reviews, planning sessions. The best team members take those high-level directions and augment them, detailing them precisely into what gets executed. Or at least that's how it should be — you cherish the people who can do that. The good sessions are a debate that arrives at consensus on direction.

With my agents, reviewing plans has become the most valuable thing I do. The hard part is pushing the agents to make great plans in the first place. There are augmented agent↔human loop techniques I use that I'll cover separately; the part I want to talk about here is the adversarial review process.

Inspired by Karpathy's llm-council, I built an adversarial council for each plan, especially the largest features. Three frontier models (GPT-5, DeepSeek-V4-Pro, and originally Gemini) read the plan, attack it from three angles, and write their critiques into a review document on the Linear ticket. Up to two rounds. Then the developer-agent integrates everything and writes the final plan.

They read it anonymized — not for the models' sake (they never see each other), but for mine. I caught myself doing exactly what I'd warn an analyst against: discounting a sharp point because "oh, that's just the cheap model," and over-weighting the one with the brand-name reputation. Anonymizing the source forces me to judge the argument, not the logo.

I built two commands that sit in front of this. /lc-router reads a ticket, looks at the Linear tags, and proposes a tier: Tier 1 (single planner), Tier 2 (one round of critique), Tier 3 (two rounds plus a QA plan and a structured review.md that escalates any unresolved disagreement to me). The router deliberately biases toward escalation — a false Tier 3 costs about a dollar; a false Tier 1 is how real failure modes slip through. It posts a recommended tier with the signals that fired, then stops, so I can confirm or override before any spend.

/lc-chain runs the council. Every review-side call is a typed Pydantic AI agent with an output_type, so structured outputs come back typed and validated, with a retry on schema failure, and Langfuse traces emit automatically. The planning and synthesis reasoning stays in the Claude Code conversation on my Max subscription; only the critics go out to OpenRouter. The chain runs critics in parallel, with a quorum of two valid returns and a hard cap of two rounds. Round three is forbidden inside the council. Round three escalates to me.

Separately, the council reviews a QA plan the developer-agent writes — a YAML file spelling out the exact tasks and tests it should run to prove the feature works, including Chrome browser testing and user walkthroughs of single entities.

The anonymized council: a central planner and synthesizer surrounded by three critic members, with provider identities hashed and a quorum requiring two valid responses

The part I didn't expect to like as much as I do is the disagreement handling. Where the council splits, I get a table of exactly what the members disagree on, so my attention goes straight to the judgment calls instead of re-reading the whole plan. And because the critics grind through plans in the background, agents can be tearing into next week's features on one worktree while I'm shipping this week's on another.

Does it actually work? The clearest signal came from a data-foundation ticket where every aggregate metric looked green: join coverage reported as fine across the whole dataset. The chain forces a single-example trace — pick one entity, walk it end to end, the way a user would. That one entity surfaced 43% missing coverage that every aggregate number had smoothed over. That's the exec test, encoded: the hole I'd have found by hand in two minutes, caught before it ever reached me.

The agent-to-human interface

The chain is plumbing. The interface is what matters: the review.md, the escalation rules, the exact format in which an agent surfaces a decision it can't resolve and hands it to me. That's where a founder's judgment gets into the loop, and I think that's the real idea here, more than the council is. I'll write that one up once I've shipped more of it.

One last thing. Engineers and agents share the same itch: to build from scratch what's already solved. So part of the planning cycle now is a check — is there a startup or a framework already doing this? I don't want to own infrastructure that needs perpetually maintained if I don't have to. All of the above, I expect to be collapsed into a product someone else sells. When it is, I'll happily destroy mine and replace it.

The council I designed was three-way. The council that ships is two-way, because the third member couldn't hold the contract reliably.