equityflowunderstand the flow and follow it

The full-control path: fullstack vibecoding with your own server and Claude Code

Own the whole stack. The engineering half of vibecoding: a multi-provider workflow, architecture that contains mistakes (modularization, contracts, tests, git), why AI hallucinates code and how to stop it, subagents and orchestration, and going from localhost to a live domain you control.

At some point you want to own the whole thing: your own server, your own database, an agent like Claude Code working across the codebase, and no platform deciding what you can build. This is the full-control path. It is more work and more responsibility, you are now the one who has to think about architecture, infrastructure and security, but it is also where serious products live. This guide is the engineering half of vibecoding: how to structure a project so a mistake does not sink it, how to run more than one AI tool without chaos, how to stop the AI from confidently breaking things, and how to take it from localhost to a live domain you control.

Run more than one tool: the multi-provider strategy2 chapters · ~8 min

The multi-provider strategy: do not bet the project on one tool

Here is the uncomfortable truth about every AI coding tool: each one will, at some point, fail you. Not because it is bad, but because it is a product run by a company with capacity limits, a roadmap, a pricing team, and an incident page. Anthropic has outages. Cursor changes its pricing. A model you relied on gets deprecated, or quietly gets worse on the exact kind of code you write. If your project can only be built by one tool, then your project inherits every bad day that tool has.

The fix is not loyalty to the "best" provider. It is the ability to move between providers without rewriting your project. That ability is a strategic asset, and it is the real reason the modular architecture from a modular architecture matters.

Why one provider is a single point of failure

Lock-in with AI coding is subtle. You are not locked in by file formats, you are locked in by understanding. If one agent built your whole codebase its way, with its conventions and its mental model baked in across every file, then switching tools means a second agent has to relearn a tangle it did not write. The cost of leaving is high, so you stay, and you absorb the rate limits, the 6am outage during your launch, and the price hike.

Concrete failure modes you are insuring against:

  • Outages. Every provider has them. If your only workflow is Claude Code and Anthropic has a bad afternoon, you are blocked. If you can flip the same task to a Gemini-backed agent, you are not.
  • Rate limits. Most tools throttle by usage tier. Splitting work across two providers roughly doubles your effective ceiling on a heavy day.
  • Capability gaps. Models genuinely differ. One is sharper at gnarly type systems and refactors, another is faster and cheaper for boilerplate CRUD and migrations. You want to send each job to the tool that is good at it.
  • Pricing and deprecation. Prices and model lineups shift constantly (this is mid-2026, they will shift again before you finish reading the series). Optionality is your hedge.

How modularity turns this into parallelism

If your project is a single ball of mud, multi-provider is impossible: two agents editing the same files at once will collide and overwrite each other. But if you followed a modular architecture and split the system into modules that meet only at clean interfaces (a typed API contract, a defined function signature, a schema), then two different agents can work at the same time without touching each other's code.

A real example. Say you are building a SaaS app. You define two contracts up front: an auth module that exposes getCurrentUser() and requireRole(role), and a data-access module that exposes repositories with typed methods like orders.findByUser(id). With those interfaces fixed:

  • Claude Code works on the auth module in one branch (sessions, token rotation, role checks). This is fiddly security-shaped logic, which it handles well.
  • Google Antigravity (its agent-first IDE built on VS Code, running Gemini 3 Pro, released late 2025 and in public preview as of mid-2026) works on the database layer in parallel (schema, migrations, the repository methods).

Neither agent needs to understand the other's internals. The auth module only needs orders.findByUser to exist with the agreed signature; it does not care how the query is written. They integrate at the interface, not in each other's source. You get genuine parallelism (two streams of work happening at once), and as a bonus, two providers' rate limits and two providers' strengths.

The interface is the contract that lets two strangers build one machine. Spend your effort defining it; the modules behind it become swappable, and so do the tools that write them.

Making it work in practice

This is a discipline, not a button. A few rules keep it from turning into chaos:

  • Write the interface before you assign the work. The contract (types, signatures, the API shape) is human-owned and agreed first. It is the seam everything else hangs off.
  • One module, one branch, one agent at a time. Parallelism lives between modules, never inside one file. Use separate branches or worktrees so the agents never share a working directory.
  • Keep tests at the boundary. Integration tests against the interface catch the moment one side drifts from the contract, which is exactly where multi-agent work breaks.
  • Stay portable. Avoid leaning on one tool's proprietary config or "magic" project format as load-bearing. The more standard your repo (plain git, conventional structure, real tests), the easier any agent or provider can pick it up cold.

The payoff is resilience you can feel. When a provider has an outage mid-sprint, you reassign the task and keep moving. When one model gets noticeably better at frontend, you route frontend to it. You are not married to a vendor; you are running a portfolio. And the thing that makes the portfolio possible is not the tools at all. It is the clean modular architecture underneath, which is why a modular architecture came first.

Parallelizing across providers without chaos

Once you trust one agent, the temptation is obvious: run four. Point Claude Code at the auth refactor, Cursor at the new billing page, a third session at test coverage, a fourth at the docs. In theory you just 4x'd your throughput. In practice, the failure mode is just as predictable: two agents edit the same file, the third assumes a function signature the first one just changed, and you spend the afternoon untangling a merge instead of shipping. Parallelism does not fail because the models are bad. It fails because nobody owns the boundaries.

The good news is that this is a solved problem in software, it is just one most people skip. Parallel agents need the same discipline parallel humans do, applied a little more strictly because an agent has no peripheral vision. It only sees the context you give it, so it will happily clobber work it never knew existed.

The one rule that prevents most pain

One agent owns one module at a time. Not one feature spread across the codebase, one bounded surface: a directory, a service, a clear set of files. If two agents could plausibly touch the same file, you have a coordination bug waiting to happen, and you should either serialize them or redraw the boundary. This is the same instinct as Conway's law and good module design, and it is exactly why a clean codebase parallelizes well and a tangled one does not.

  • Assign by module, never by vague task. "Own /billing" is enforceable. "Work on payments" is not.
  • Integrate through interfaces, not internals. Agent B consumes Agent A's module through a typed, tested contract (an API route, an exported function signature), and never reaches into A's files.
  • Freeze shared files. Schemas, route tables, and dependency manifests are high-collision zones. Edit those yourself, or hand them to exactly one agent.

Git worktrees: the missing primitive

The mechanism that makes this clean is git worktree. A worktree gives each agent its own working directory checked out to its own branch, all backed by the same repository. No stashing, no branch switching, no two processes fighting over the same files on disk. As of early 2026 this stopped being a manual trick: Claude Code ships a worktree flag so claude --worktree feature-x spins up an isolated branch and directory, and Cursor built its "Parallel Agents" feature directly on worktrees. Tooling like ParallelCode wraps the same idea for multi-agent runs.

A concrete setup. You are adding Stripe billing to a SaaS app:

  1. git worktree add ../app-billing feature/billing, point Claude Code at it, and tell it: you own src/billing/** and expose createCheckoutSession().
  2. git worktree add ../app-emails feature/emails, point Cursor at it: you own src/email/** and expose sendReceipt().
  3. Each runs on its own port and its own branched database so they never collide at runtime either.
  4. When both pass their own tests, you (the human, or an orchestrator session) merge through an integration branch, run the full suite, and resolve the few conflicts that remain in shared glue.

The two agents never see each other's files. They agree only on the two function signatures, and those become the integration test. That is the whole game: isolate hard so the agents do not have to coordinate, then merge deliberately in one place under a passing test suite.

Across providers, not just sessions

Running Claude on one module and Cursor or another provider on the next is fine, and sometimes useful when their strengths differ. The discipline does not change: the boundary is git and the tested interface, not the vendor. A worktree does not care which agent edits it. What matters is that ownership is unambiguous and the contract between modules is written down and verified.

When parallelism is not worth it

Be honest about the overhead. Every parallel stream costs you a worktree to manage, a branch to track, a merge to review, and a context to hold in your own head. Below a certain size that tax is larger than the speedup. A small change, anything where the modules are genuinely entangled, or any work where you cannot cleanly say "this file belongs to exactly one agent" is almost always faster done sequentially.

Parallelize independent modules with clear contracts. Serialize anything that shares state. If you cannot name the owner of every file before you start, you are not ready to run agents in parallel, you are ready to create a merge disaster.

Takeaway: the constraint that makes parallel agents safe is not smarter models, it is hard isolation (one owner per module, a worktree per stream) plus one deliberate merge point gated by tests. Add streams only when the boundaries are clean and the time saved clearly beats the coordination cost.

Architecture for the long run9 chapters · ~34 min

Modularization is the whole game

Here is the single most important idea in this entire article, and the one that decides whether AI makes you faster or eventually buries you: how your code is divided up matters more than how any individual piece is written. You can let AI write code you do not fully understand. Founders do it every day and ship real products. What you cannot survive is letting AI write code you do not understand inside a system with no internal walls. The first is delegation. The second is a time bomb.

Watertight compartments

The Titanic analogy is overused, so let me use it correctly. A ship's hull is divided into watertight compartments. The point is not that water never gets in. The point is that when it does, the damage is contained to one or two compartments and the ship stays afloat. The Titanic sank because too many compartments flooded at once and the bulkheads did not run high enough. The failure was architectural, not bad luck.

Your software is the ship. Every feature, every integration, every risky AI-generated change is potential water. The question is never "will something break" (it will) but "when it breaks, how much sinks". A well-modularized codebase has bulkheads: clear boundaries between parts, where each module talks to its neighbors through a small, defined interface and knows nothing about their guts. A tangle has none. One leak, and the whole hull floods.

Blast radius is the metric that matters

Think in terms of blast radius: when a change goes wrong, how far does the damage spread? In a tangled project, the blast radius is "everything", because everything reaches into everything else. You ask the AI to tweak how prices display, and three days later checkout is silently broken, because the same function quietly handled both. You did not know it did both. Neither, really, did the AI.

In a modular project, the blast radius is one module. If your payment logic lives behind a clean boundary, separate from your email logic, then a botched AI edit to email cannot corrupt a charge. Worst case, you delete that module's changes and try again. You do not have to understand the payment code to trust that it is safe, because the seam protects it. That is the real unlock: you trust the boundaries, not your comprehension of every line.

A concrete example

Say you are building a SaaS dashboard. Tangled version: one giant file (or a handful) where the database queries, the business rules, the API routes, and the UI rendering all call each other directly. You prompt Cursor or Claude Code to "add team invitations". It threads invitation logic through the user-loading code that login, billing, and settings all share. A subtle change there now affects all four. You are debugging blind in a system where any thread pulls the whole sweater.

Modular version: auth/, billing/, invitations/, notifications/, each exposing a few named functions and hiding the rest. "Add team invitations" becomes a new invitations/ module plus one new call from the settings page. If it is wrong, you throw away that folder. Login still works. Billing never knew it happened. You shipped a feature without auditing the whole app, and you could because the walls held.

This is what lets non-experts build safely

People hear "you don't need to understand the code" and "modularization is critical" as contradictory. They are the opposite. Modularization is precisely what makes the first statement survivable. You do not need to understand the inside of the engine room if the bulkhead around it is sound. The skill you actually need is not reading every line; it is insisting on the seams, and noticing when the AI starts dissolving them (a function that touches four unrelated things, a module that imports half the project) and stopping it.

The goal is not a codebase you fully understand. It is a codebase where any single mistake stays small enough to throw away.

Practical takeaways:

  • Demand boundaries up front. Tell the AI to keep features in separate modules with small public interfaces, and say so in your project rules so it holds across sessions.
  • Watch the imports. When one file starts reaching into everything, that is a bulkhead failing. Ask the AI to refactor before adding more.
  • Test the seam, not the soul. You do not need to grok a module's internals; you need to trust what goes in and comes out. Pin that with a test or two.
  • If a change feels scary, your modules are too big. Fear is a design signal. Split until a bad edit is a small, deletable mistake.

Drawing module boundaries an AI can work behind

The single biggest lever you have over how well an AI builds your software is not the model, the prompt, or the IDE. It is how you cut the project into pieces. Get the boundaries right and a model (or a junior, or future-you at 1am) can change one part without understanding the rest. Get them wrong and every change ripples into six files, the AI loses the thread, and you spend your evening untangling a bug it created two prompts ago.

A module is just a chunk of code that exposes a small, clear interface and hides everything else behind it. The interface is the contract: a handful of functions with sensible names, inputs, and outputs. Everything inside (the database calls, the third-party SDK, the retry logic) is private. Nobody outside is allowed to reach in.

The payments example

Say your app needs to take money. The naive version sprinkles Stripe calls everywhere: a bit of stripe.PaymentIntent.create in the signup flow, more in the upgrade flow, error handling copy-pasted in three places. Now Stripe is woven through your whole codebase. Switching to a different processor, or even upgrading the Stripe SDK, means hunting through everything.

The disciplined version puts one module in the middle. The rest of the app only ever sees:

  • charge(user, amount) returns a result (success, or a typed failure like CardDeclined).
  • refund(chargeId).
  • maybe getReceipt(chargeId).

That is the entire contract. Inside the module lives the Stripe API key, the idempotency keys, the webhook parsing, the currency rounding, the retry-on-timeout. None of it leaks. The checkout page calls charge(user, 4900) and is done. If you later move to Adyen or add Apple Pay, you rewrite the inside of the module and the checkout page never changes.

Why this is what makes vibecoding work

Here is the part that matters for building with AI: to direct work on a module, you only need to understand its contract, not its internals. You do not need to know how Stripe webhooks are verified to tell an AI "build me a payments module that exposes charge and refund, hides Stripe, and returns a typed error on a declined card." That is a complete, checkable spec a non-expert can write and a non-expert can verify, because you test it through the front door: call charge, see if money moves.

This also keeps each AI task inside a context window the model can actually hold. An AI editing the payments module needs the payments code plus the contract for anything it calls. It does not need your email system, your auth, or your analytics. Small, well-fenced modules mean small, focused prompts, which means fewer hallucinated cross-references and far less collateral damage when something goes wrong. When a model only sees one room, it cannot knock down walls in the rest of the house.

Practical rules for where to cut

  • Cut along things that change for the same reason. Everything that has to change when Stripe changes belongs in one module. Everything that changes when your email provider changes belongs in another.
  • Cut at the external world. Every third-party service (payments, email, SMS, file storage, the LLM API itself) gets wrapped behind your own small interface. Never let a vendor's SDK spread past its wrapper.
  • A good interface is narrow and names what, not how. sendWelcomeEmail(user) is a boundary. smtpConnectAndQueue(...) is an implementation detail that escaped.
  • Keep state private. If two modules read and write the same database table directly, they are not really separate. Route it through one owner.
  • Write the contract down. A short comment or type signature at the top of the module ("inputs, outputs, what can fail") is the brief you hand the AI and the thing you check its work against.
  • If you cannot describe a module in one sentence without "and", it is probably two modules.

Takeaway: before you ask an AI to build anything non-trivial, decide the boundaries yourself and write the contracts. Spend ten minutes naming the modules and their public functions, and you turn one impossible 5,000-line prompt into a dozen small jobs you can hand out, check, and swap one at a time. The boundary is the unit of trust: it is exactly the line behind which you let the AI work without watching every keystroke.

Contracts and interfaces are the glue

Software is not one big thing. It is many small things wired together, and the wires are where projects fall apart. The wire between two modules is a contract: a function signature, a typed interface, an API schema, a database table. The contract says "if you give me input shaped like this, I promise to return output shaped like that." Everything else is implementation detail hidden behind it. When you are building with AI, contracts matter more, not less, because a coding agent that cannot see the whole system can still write correct code if the seam it is working against is written down precisely.

Why types are leverage, not bureaucracy

A type is a contract a machine can check. TypeScript, Python type hints, and Pydantic models all turn "I think this returns a user object" into a fact the editor, the compiler, and the AI all verify before anything runs. This is the cheapest bug-catch you will ever get. A mismatch like passing userId as a number where the function expects a string shows up as a red squiggle, not a 2am production incident.

This is also why types make agents dramatically more reliable. An agent renaming a field, refactoring a function, or wiring up a new endpoint gets immediate, local feedback from the type checker. It can iterate against tsc or mypy in a loop until the errors are gone, without ever needing to understand the rest of your codebase. Untyped code gives the agent nothing to push against, so it guesses, and guesses are how hallucinated method names sneak in.

A concrete contract: frontend talking to backend

Say you are building a feature to list a user's portfolio holdings. The frontend (React) and the backend (a Python FastAPI service) are two separate modules, often built by two separate agent sessions. The contract between them is the API schema. Define it first:

  • GET /api/v1/holdings?accountId={uuid}
  • Returns 200 with a JSON array of { ticker: string, shares: number, marketValue: number, currency: string }
  • Returns 404 if the account does not exist, 401 if the caller is not authorized

On the backend you encode that as a Pydantic model, which is the single source of truth:

class Holding(BaseModel): ticker: str; shares: float; market_value: float = Field(alias="marketValue"); currency: str

FastAPI generates an OpenAPI document from that model automatically. OpenAPI 3.1 (the current spec, now aligned with JSON Schema 2020-12) is just a machine-readable description of every endpoint, its parameters, and its response shapes. From that one document, both sides stay honest: the frontend runs a generator (openapi-typescript or similar) to produce TypeScript types that exactly match the server, and any drift becomes a compile error in the React app. Change shares from a number to a string on the server and the frontend build breaks instantly, which is precisely what you want.

The payoff for AI work: once the OpenAPI schema exists, you can hand one agent the frontend with a mock server generated from the spec, and another agent the backend, and they work in parallel, each blind to the other's code. They are not coordinating through you. They are coordinating through the contract.

Design the interface first, implement behind it

The discipline is old and it survives AI intact: agree on the seam before you fill it in. Write the function signature, the Pydantic model, the OpenAPI path, or the database schema first. Get it reviewed (by you, or by a second model acting as critic). Only then ask an agent to implement the body. A stable, explicit interface turns a vague "build me a holdings feature" into two well-scoped, independently testable jobs, and it gives every tool in your stack something concrete to validate against.

  • Write the contract before the code. Signature, schema, or interface first; implementation second.
  • Make it typed and machine-checkable. A contract a tool can enforce (TypeScript, Pydantic, OpenAPI) catches mismatches in the editor, not in production.
  • Keep one source of truth. Generate the other side from it (types from the schema) rather than hand-syncing two copies that will inevitably drift.
  • Let the contract be the coordination layer. When the seam is explicit, agents (and people) can build each side in parallel without reading each other's code.

Separation of concerns in practice

Separation of concerns is the oldest idea in software architecture, and it is also the single most important habit for vibecoding. Not because it makes your code prettier, but because it controls the blast radius of every change an AI makes. When you ask an agent to "add a discount field to checkout," you want it touching one thing, not three. The way you guarantee that is by keeping your concerns physically separate in the first place.

The five concerns of a typical web app

Most products, whether you build them in Cursor, Lovable, or plain VS Code, reduce to five layers. Keep them distinct and swappable:

  • Auth: who is this user, are they logged in, what are they allowed to do. (Clerk, Auth.js, Supabase Auth, or your own.)
  • Data / persistence: how rows are read and written. Schema, migrations, queries. (Postgres via Prisma or Drizzle, Supabase, etc.)
  • Business logic: the rules that make your product yours. "A subscription downgrades at period end, not immediately." Pure functions, no HTTP, no SQL syntax.
  • API layer: the contract between client and server. Route handlers, input validation, serialization. (Next.js route handlers, tRPC, REST endpoints.)
  • UI: components and pages. Renders state, fires events, knows nothing about how data is stored.

Why mixing them makes AI dangerous

The classic vibecoding disaster is a React component that opens a database connection, runs a raw SQL query, checks the user's role inline, applies the pricing rules, and renders a button, all in one 300-line file. A human can hold that mess in their head for a while. An AI agent given the prompt "make the pricing table look nicer" will happily rewrite that file and, somewhere in the diff, drop your role check or change WHERE status = 'active' to something subtly wrong. You asked for a UI change and got a security regression, because in that file the UI was the security check.

Separation fixes this mechanically. If the role check lives in an auth module, the SQL lives in a data module, and the pricing lives in a business-logic module, then "make the pricing table look nicer" can only touch the component file. The agent has nothing else to break, because nothing else is there. Layers are how you turn a vague prompt into a bounded edit.

A simple layered structure

You do not need a framework or a clever pattern. A folder convention is enough, and the AI will follow it if you state it once in your rules file:

src/
  auth/        // session, current user, permission checks
  db/          // schema + query functions, the ONLY place SQL lives
  domain/      // pricing, subscriptions: pure logic, no I/O
  api/         // route handlers: validate input, call domain, return JSON
  ui/          // components and pages: call api, render

The rule that makes it work is the dependency direction: UI calls API, API calls domain and db, domain calls nothing. A component never imports from db/. If you ever see a database query inside a .tsx file, that is the smell to fix before it spreads. Put one line in your CLAUDE.md or .cursorrules: "Database access only in src/db. UI components must not import from src/db or src/auth directly." Agents respect explicit constraints far better than implied ones.

Concrete example: changing the discount rule

Say a customer reports that an expired coupon still applies. With mixed code, you would grep for "discount" and find it in four files, fix three, miss one. With layered code, the rule lives in domain/pricing.ts in a function like applyDiscount(cart, coupon). You point the agent at exactly that function, it edits a few lines, and your existing unit test on that pure function tells you instantly whether it is right. The UI, the API, and the database never move. That is a five-minute fix instead of an afternoon of regression-hunting.

One concern, one agent

This is where a multi-provider setup pays off. Because the layers are isolated, you can assign them to separate agents, sessions, or even providers without them stepping on each other. Cursor 3.2 (April 2026) ships an Agents Window and a /multitask mode that fans work out to parallel subagents, often using git worktrees so each runs on its own branch. Lovable's Agent Mode lets one person refine the frontend while another extends the backend through GitHub without blocking. The pattern that holds up in practice:

  • One session (or your strongest model) owns the domain layer, where bugs are expensive and context matters most.
  • A separate, cheaper session grinds through UI components against a fixed API contract.
  • A focused session handles migrations in db/, in its own worktree, so a schema change never collides with UI work.

The contract between them is the API layer and the type definitions. As long as those stay stable, the agents can work in parallel and you review one bounded diff per concern. Mixed-up code makes this impossible: every agent touches every file, and you are back to merge conflicts and mystery regressions.

Takeaway: Separation of concerns is not architectural vanity in vibecoding, it is your safety system. Isolated layers shrink each AI edit to a known surface, make changes reviewable, and let you split work across agents without chaos. If you do one thing before scaling up your AI usage, keep your database out of your components.

Vertical slices vs horizontal layers

When you hand work to an AI, you are really deciding how to cut it. There are two basic shapes, and the choice matters more with AI than with a human team, because the AI has no memory of yesterday and no instinct for what "done" means. Pick the wrong shape and you get a pile of half-built layers that compile, demo nothing, and quietly drift apart.

The two shapes

Horizontal layers: you build by tier. First the whole database schema, then the entire API surface, then all the UI. Each pass is wide and shallow. This is how a lot of people instinctively prompt: "design the full data model for the app," then later "now build all the endpoints."

Vertical slices: you build one complete feature end to end, through every tier, before starting the next. "A user can reset their password" touches the database (a reset-token table), the backend (request-reset and confirm-reset endpoints, an email send), and the UI (a form, a success state, an error state). When that slice works, a real person can actually do the thing.

Why vertical wins for AI-assisted work

A vertical slice is a shippable, testable increment. That single property fixes most of the ways AI coding goes wrong:

  • It closes the feedback loop. You can click the button and see the email arrive. With horizontal layers you cannot test anything real until all three layers exist, so bugs accumulate in the dark and surface at integration time, which is the worst time.
  • It contains the blast radius. If the AI invents a wrong assumption (it will), the damage is one feature, not your entire schema. A bad slice is a clean revert. A bad horizontal pass is woven through everything built on top of it.
  • It keeps the model honest about scope. "Build all the endpoints" invites the AI to confidently scaffold thirty routes you do not need yet, half of them hallucinated against a schema that will change. One slice is small enough that you can read every line it wrote.
  • It survives the AI's amnesia. Tools like Cursor or Claude Code work best when each task is self-contained. A slice fits in one focused session; a half-finished horizontal layer relies on context the next session will not have.

A concrete slice

Say you are adding "user can export their transactions to CSV" to a finance app. The horizontal instinct is to add export columns across every table, then build a generic export service, then wire up buttons everywhere. Instead, slice it:

  1. Define done. "Logged-in user clicks Export on the Transactions page, gets a CSV of their own transactions, with the right columns, and nobody else's data."
  2. One prompt, one slice. Ask the AI for exactly that: the GET /transactions/export endpoint scoped to the current user, the CSV serialization, and one button with a loading and error state. No generic export framework.
  3. Test the slice as a user. Click it. Open the file. Then test the scary case: log in as a second account and confirm you cannot pull the first user's rows. That authorization check is exactly the kind of thing AI quietly skips, and it is exactly what a vertical slice forces you to verify.
  4. Commit, then move on. Now extend to "export filtered by date range" as the next slice, on solid ground.

When horizontal is actually right

Layers are not the enemy; premature width is. Go horizontal when the layer is genuinely a shared foundation that several upcoming slices need and that you understand well: setting up auth once, picking and configuring the database, establishing a design-system component library, or a one-time data migration. The test is honest: build a layer up front only when you are confident about its shape and at least two or three concrete slices are waiting on it. If you are guessing, slice instead and let the foundation emerge from real features.

Takeaway: default to vertical slices. Cut work so that every unit of AI output is something a real user can do, end to end, and something you can revert in one commit. Reach for a horizontal layer only when it is a known, shared foundation that several slices clearly depend on.

Repo structure: monorepo, polyrepo, and folders an AI navigates

An AI agent does not "know" your codebase the way you do. It rebuilds a mental model from scratch every session by reading file names, opening files, and grepping for symbols. That means your folder layout is not cosmetic. It is the interface the agent uses to find things, and a tidy, conventional structure measurably reduces guessing. When the agent has to infer where the auth logic lives, it sometimes guesses wrong, writes a duplicate, or edits a stale copy. When the path is obvious, it goes straight there. Repo structure is, quietly, one of the highest-leverage things you control.

Monorepo vs polyrepo, through an agent's eyes

The classic trade-offs still hold (one repo is easier to refactor across but harder to scale; many repos isolate teams but fragment context), but AI shifts the weighting. The thing an agent is worst at is the thing it cannot see. In a polyrepo setup, when you ask it to change an API endpoint, the frontend that consumes that endpoint lives in a different repo the agent never loaded, so it happily changes one side of the contract and breaks the other. In a monorepo, both sides sit in one tree, and a half-decent agent will grep across them and catch the dependency.

So for most small teams and solo builders vibecoding a product, a monorepo is the better default. You get unified context, atomic cross-cutting changes, and one place for the agent to look. The honest caveat: monorepos can grow large enough that the agent wastes its context window wandering. That is exactly what tooling like Nx and Turborepo exists for: a dependency graph and clear package boundaries so both you and the agent can reason about one slice at a time. Reach for polyrepo when repos have genuinely separate lifecycles, owners, or security boundaries, not just because the codebase "feels big".

A folder structure an agent navigates well

The principles are boring and that is the point. Predictable beats clever. An agent that has read ten thousand repos has strong priors about where things live, so meeting those priors costs it nothing.

  • Clear top-level modules. Group by domain or service, not by a deep tangle of technical layers. apps/ and packages/ is a convention the model recognises instantly.
  • Predictable naming. Pick one casing and one vocabulary and hold it. If it is user-service here, do not make it UserSvc there.
  • Colocated tests. A billing.ts next to billing.test.ts tells the agent exactly what to update when it changes behaviour, and it will.
  • A README per module. A few lines on what the module does and its key entry points saves the agent a dozen file reads.

Concrete example, a small SaaS monorepo:

  • apps/web/ (Next.js frontend, with its own README.md)
  • apps/api/ (backend service, README.md)
  • packages/db/ (schema, migrations, queries)
  • packages/auth/ (auth.ts, auth.test.ts side by side)
  • packages/ui/ (shared components)
  • AGENTS.md at the root, plus a focused AGENTS.md inside apps/api/

An agent told to "add rate limiting to the login route" can now reason its way there: login is auth, auth is packages/auth, the route lives in apps/api, and there is a test file to extend. No archaeology required.

The map at the top: AGENTS.md and CLAUDE.md

Structure shows the agent where things are; a root instruction file tells it the things it cannot infer. AGENTS.md has become the near-universal format here: a plain-Markdown file, stewarded by the Linux Foundation as of late 2025, read natively by Claude Code, Cursor, GitHub Copilot, Codex CLI, Aider and others. Claude Code also reads CLAUDE.md; the simplest move is to keep one canonical file and symlink the other. Put your build and test commands, the directory map, naming conventions, and hard rules ("never edit packages/db/migrations by hand") in it. You can nest these files: a root one for global rules, a per-module one for local detail the agent loads only when working in that folder.

Takeaway: default to a monorepo with apps/ and packages/, colocate tests, drop a short README in every module, and write an AGENTS.md as the map. You are not decorating, you are reducing the number of wrong guesses the agent can make.

The spec and docs layer the AI actually reads

Here is the highest-leverage, lowest-cost move in all of AI-assisted development, and almost nobody does it properly: write down what your project is, how it is built, and what the agent must never do. Put it in a file the agent reads automatically. That is it. A coding agent without a project doc is a senior contractor who walked in off the street, never saw your codebase, and started writing code based on vibes. A coding agent with a good project doc is one who read the onboarding wiki first. The second one makes a fraction of the wrong assumptions.

Why a spec beats prompting blind

When you prompt an agent cold ("add a webhook handler for Stripe"), it has to guess everything that is not in the prompt: your folder layout, your test runner, whether you use Prisma or raw SQL, whether secrets live in .env or a vault, whether you even allow new dependencies. Each guess is a coin flip, and wrong guesses compound. You then spend your review budget catching a hand-rolled HTTP client when you already had one, or a migration written in the wrong tool.

A written spec collapses those coin flips into facts. The agent stops inventing conventions because the conventions are on the page. This is true for the per-task spec (a short doc describing what you are about to build and the acceptance criteria) and for the standing project doc that every task inherits. Writing the spec first also forces you to think, which is often where the real bugs get caught, before a single line is generated.

CLAUDE.md, AGENTS.md, and friends

The conventions have mostly settled by mid-2026. AGENTS.md is the emerging cross-tool standard: a plain Markdown file at the repo root that multiple coding agents look for. CLAUDE.md is Claude Code's equivalent, loaded automatically into context at the start of a session (it also supports nested files, so a subdirectory can carry its own). Many teams keep a one-line CLAUDE.md that just points at AGENTS.md so they maintain a single source of truth. Alongside those, a short README per module and your mission/spec docs cover the "why" that a root-level file is too coarse to hold.

What actually goes in the file

Keep it tight and specific. A bloated doc is as useless as none, because the agent skims and you stop maintaining it. A strong AGENTS.md / CLAUDE.md covers, with real values, not placeholders:

  • Stack and versions: "Next.js 15 (app router), TypeScript strict, Postgres via Drizzle, pnpm." No guessing.
  • Commands: exactly how to run tests, lint, typecheck, and the dev server. pnpm test, pnpm lint, pnpm typecheck. The agent will run these to check its own work if you tell it they exist.
  • Conventions: "Server components by default; mark client components explicitly. Validate all input with Zod at the boundary. No any."
  • Architectural boundaries: "Domain logic lives in src/core and must not import from src/web. Database access only through the repository layer."
  • Do-nots: "Never edit files in generated/. Never add a dependency without asking. Never touch the auth or billing code without a callout. Never write to production config."
  • How to commit: "Conventional Commits, present tense. One logical change per commit. Do not commit secrets or .env files."

A concrete starter for a small SaaS backend:

Stack: Node 22, Fastify, Postgres (Drizzle), pnpm. Run: pnpm dev, pnpm test (Vitest), pnpm lint, pnpm typecheck before every commit. Layout: routes in src/routes, business logic in src/services, DB only via src/db/repos. Rules: validate request bodies with Zod; no raw SQL outside repos; no new deps without asking. Never touch: src/db/migrations (generated), infra/. Commits: Conventional Commits, one change each, never commit .env.

That is maybe twelve lines, and it eliminates the ten most common categories of wrong guess. You can have an agent draft a first version by pointing it at the repo and asking it to document what it observes, then you correct the parts it got wrong. That correction pass is worth doing by hand because it is precisely the institutional knowledge the agent could not infer.

Takeaway

Before your next AI build session, spend twenty minutes writing an AGENTS.md (and a one-line CLAUDE.md pointing to it) with your stack, commands, boundaries, and do-nots. Treat it as living: when the agent makes a wrong assumption, do not just fix the code, add the rule to the file so it never makes that assumption again. It compounds. A good project doc is the single best return on effort in AI-assisted work, and it costs you a coffee break. For the current AGENTS.md convention see agents.md.

Tests are the safety net that lets AI move fast

Here is the core problem with letting an AI agent touch your code: it can change a hundred lines across nine files in thirty seconds, and you cannot possibly re-read all of it with the same care you would give your own work. So either you slow down to a crawl and review every diff line by line, or you trust the machine and pray. Automated tests are the third option, and they are the only one that scales. A good test suite is a promise the code makes to itself: "if I still pass, I still work." That promise is exactly what lets an agent move fast without you holding your breath.

Why tests matter more with AI, not less

Without AI, you wrote the change, so you held a mental model of what it should do. With an agent, that model lives somewhere you cannot inspect. The agent might "fix" a bug by deleting the code path that triggered it, or satisfy your request in a way that quietly breaks something three modules away. Tests turn an unverifiable claim ("I refactored the billing logic and it's fine") into a checkable fact ("228 tests pass, including the 14 covering proration"). In practice this changes the whole loop: tools like Claude Code, Cursor, and Codex can run the suite themselves after every change, read the failures, and iterate until green before they ever hand the diff back to you. You are reviewing a result that already cleared a bar, not a hopeful first draft.

What to test first

You do not need 100% coverage, and chasing it is a waste. Cover the things that hurt when they break, in this order:

  • The money. Anything that charges a card, computes a balance, applies a discount, or splits a payout. A wrong invoice is not a bug, it is a refund and a churned customer.
  • The critical paths. Signup, login, the one workflow that is the actual product. If these break in production you find out from angry users, which is the worst monitoring system there is.
  • The contracts. The shape of your API responses, the columns a function returns, the events other services consume. These are the seams where an agent's local change leaks into a global break.

Test behavior at these boundaries, not implementation trivia. A test that asserts "POST /orders with an out-of-stock item returns 409 and does not charge the card" is durable. A test that asserts a private helper was called twice will fight every legitimate refactor the agent tries.

The trap: tests that just rubber-stamp the code

This is the one that bites people who let the agent write tests unsupervised. Ask an AI to "add tests for this function" and it will happily read the implementation and assert exactly what the code currently does. If the function has a bug, the test now enshrines the bug as expected behavior. This is the oracle problem: the test's source of truth is the buggy code, not the intended spec, so it passes forever and protects nothing. Snapshot tests are especially prone to this. A giant auto-generated snapshot that just records "whatever came out last time" turns every real regression into a noisy diff that everyone reflexively re-blesses.

The fix is to feed the agent the intended behavior, not let it reverse-engineer it. Concrete example: you have a calculateRefund(order, daysUsed) function and you suspect it is off by a day. Do not say "write tests for calculateRefund." Say "write tests asserting that a 30-day plan canceled after 10 days refunds exactly 2/3 of the price, and that canceling on day 0 refunds the full amount." Now the test encodes the rule. Run it, and if the existing code fails, you have just found the bug instead of cementing it. Write the test from the requirement, then let the implementation prove itself against it.

Practical takeaway

Tests are how you trust changes you did not personally verify line by line. To keep that trust honest:

  • Make the agent run the suite after every change and paste the output. Green is the entry fee for review, not the conclusion.
  • Write the test from the requirement, especially for money and contracts, so it has an opinion the code has to satisfy.
  • Before trusting a new test, sanity-check it by breaking the code on purpose. A test that passes when the feature is broken is worse than no test, because it is a green light wired to nothing.

Git discipline when an AI is editing your code

An AI agent is the fastest pair of hands you have ever worked with, and the most confident. It will refactor across forty files, rename a function it does not fully understand, and delete a config it decided was dead, all in one turn, all narrated in a cheerful summary that reads as if everything went perfectly. Git is the only thing standing between that confidence and your weekend. It is your undo button, your audit log, and your second opinion. Once an agent has write access to your repo, version control stops being good hygiene and becomes load-bearing.

Commit small, commit often, branch per unit of work

The single highest-leverage habit is shrinking the blast radius of any one change. Two rules do most of the work:

  • One branch per feature or module. git switch -c feat/checkout-rework before you point the agent at the checkout flow. If the whole experiment turns out to be a mess, you delete the branch and your main never noticed.
  • Small, frequent commits with honest messages. Commit after each coherent step that works, not once at the end of a two-hour session. A history of fifteen tight commits is fifteen safe places to land. One giant commit is a cliff.

This is also why the IDE checkpoint features are not a substitute for git. As of early 2026, Cursor and Claude Code both offer in-editor checkpoints that let you rewind the agent's edits. They are genuinely useful for "undo the last thing the agent did," but they are scoped to the tool's own edits, often not to commands the agent runs in the terminal, and they live inside the editor rather than in your repo. Treat them as a fast local undo, and treat git as the real record.

Review the diff before you commit. Every time.

The non-negotiable rule: read git diff (or the staged diff in your editor) before anything enters a commit. The agent's prose summary is a description of what it intended. The diff is what it actually did. Those are not always the same sentence.

This is exactly where blind bulk staging gets people hurt. A real and common failure: you ask an agent to "clean up the project," it runs git add -A on your behalf or you do it reflexively, and the staged set quietly includes the deletion of files the agent never created and did not understand: your .env.example, a migration, a fixtures directory it judged "unused." git commit -am sweeps all of it in. Now the deletion is in your history, and if you pushed, in everyone's. The fix is boring and reliable: never stage with -A blindly. Stage deliberately (git add path/to/file), and scan git status for any deleted: line you did not ask for before you commit.

Recovery: the commands to know cold

When an agent breaks something, your move depends on whether the damage is committed, pushed, or still in the working tree.

  • Throw away uncommitted changes to one file: git restore path/to/file resets that file to the last commit. To nuke all uncommitted changes: git restore . (and git clean -nd to preview untracked junk before git clean -fd removes it).
  • Undo a bad commit you have NOT pushed, keeping the changes to re-edit: git reset --soft HEAD~1 moves the commit off but leaves your work staged. Use --hard only when you truly want the changes gone.
  • Undo a commit you HAVE pushed (or shared): git revert <sha>. This creates a new commit that inverts the bad one, so you never rewrite shared history.
  • Recover a file from an earlier commit: git restore --source=<sha> path/to/file pulls one file back from a known-good point without touching the rest.
  • "I lost work and panicked": git reflog lists where HEAD has been, including states that look gone. Find the sha, then git checkout <sha> or git reset --hard <sha>. The reflog has saved more careers than any backup.

A concrete sequence: the agent "refactors" auth, tests pass, you commit, then ten minutes later realize it silently dropped the rate-limiter. You have not pushed. git revert is overkill; git reset --soft HEAD~1 brings the change back into staging, you fix the one missing piece, recommit clean. If you had already pushed, you would git revert instead and follow up with a fixed commit.

Takeaway: branch per feature, commit small and often, read the diff before every commit, and never git add -A on faith. Learn restore, reset, revert, and reflog before you need them, because the day you need them, the agent will have already moved on to the next file.
Hallucinations and failure modes3 chapters · ~11 min

Why AI hallucinates code, and what that looks like

A coding model is not looking up the right answer. It is predicting the next most likely token given everything before it. That distinction is the whole chapter. When the most plausible continuation happens to also be correct, you get working code, which is most of the time. When plausible and correct diverge, plausibility wins, because the model has no internal mechanism that privileges truth over fluency. It will write a confident, well-formatted, perfectly-named function that calls an API that has never existed. The danger is not that the output looks broken. It is that it looks exactly right.

The mechanism, briefly

Trained on a huge corpus of code, the model learns the shape of correct software: naming conventions, idiomatic patterns, how a library tends to be called. It does not learn a verified registry of what is real. So when you ask for something on the edge of or just outside its training, it interpolates. It produces what a real method would plausibly be named, what a real package would plausibly be called. Fluency and accuracy usually coincide in code, which is exactly why the failures are so dangerous when they don't: the wrong answer wears the same clothes as the right one.

The four failure modes you will actually hit

  • Inventing methods and APIs. The model calls response.parse_json() on an HTTP client whose real method is response.json(), or invents a config flag that sounds canonical but was never shipped. It guessed the shape of the library instead of recalling its actual surface.
  • Outdated or shifted versions. Training data is a snapshot. A model confidently writes code against an API that was renamed, deprecated, or restructured after its cutoff. The classic example is a major library that changed its import style or auth flow between versions: the AI produces the old, plausible, no-longer-correct pattern.
  • Hallucinated package names. The model recommends pip install or npm install for a package that does not exist. It sounds completely legitimate because the model is good at naming things the way humans name things.
  • Subtly wrong logic. The worst category. The code runs. It passes a casual glance. But it uses a < where it needed <=, handles the empty-list case wrong, or computes a date offset that is off by one timezone. No error, no red squiggle, just a quiet defect that ships.

When a hallucination becomes an attack: slopsquatting

The package-name failure has grown into a named supply-chain threat. Security researchers studying AI-generated code found that every model they tested hallucinated package names, with rates ranging widely across languages and models, and crucially that many hallucinations recur: the same fake name gets suggested again and again. Attackers noticed. They register the hallucinated name on PyPI or npm, fill it with malicious code, and wait for the next developer to paste an AI suggestion and run install. This is "slopsquatting," and unlike typosquatting it does not rely on you fat-fingering a URL. It relies on you trusting plausible output. (See Socket's writeup and Help Net Security.)

A concrete example

Ask a model to "fetch and cache rates from a currency API in Python" and you might get a tidy block that does import currencyapi, calls currencyapi.get_latest("EUR"), and caches with a decorator. It reads beautifully. Three things can be quietly false: currencyapi may not be a real package (and if an attacker has registered it, you just installed malware), get_latest may not be the real method name, and the cache decorator may never expire, so you serve stale rates forever. Nothing throws an error. You only find out when a customer is billed at last week's exchange rate.

The takeaway

Treat AI code the way you would treat code from a fast, fluent, slightly overconfident junior who never says "I'm not sure." The output being plausible tells you nothing about whether it is correct.

  • Verify every package name against the real registry before you install it. A name you have never heard of is a flag, not a convenience.
  • Cross-check unfamiliar methods and flags against the official docs, not the model's confidence.
  • Reserve your sharpest scrutiny for logic that runs clean, because "it works" and "it is correct" are not the same claim.

Techniques that actually cut hallucination

A model hallucinates when it fills a gap in its knowledge with something plausible instead of something true. In code, that means an invented function, a parameter that does not exist, an API shape from two versions ago, or a package name that was never published. You cannot make a model stop guessing entirely, but you can shrink the gaps it has to guess across and you can catch the guesses fast. The whole game is the same one sentence repeated: constrain the surface, then verify against reality.

Ground it in real files and real docs

The single biggest win is refusing to let the model work from memory. If it needs an API, paste the actual reference or point it at the actual file. In Cursor, that means using @ to pull a specific file or the relevant docs into context rather than asking "how does our auth client work" and hoping. Tools like the Context7 MCP server exist precisely for this: they inject up-to-date, version-correct library docs into the prompt so the model quotes the real signature instead of a 2023 one it half-remembers. A model that has the real stripe.checkout.sessions.create options in front of it does not invent a tax_behavior field that does not exist.

Keep scope small

"Build me the dashboard" forces the model to invent a hundred unstated decisions, and each invention is a chance to drift. "Add a loading skeleton to the existing InvoiceList component, matching the pattern in OrderList" gives it one job and a reference to copy. Small tasks also produce small diffs, which are the only diffs a human can actually review. Commit after each one so a bad guess costs you a revert, not a debugging session.

Pin and state library versions

Half of all "this method does not exist" errors are really version mismatches. Tell the model what you are on: "We use Next.js 15 App Router and Tailwind v4." Better, let it read your package.json so it sees the pinned versions directly. Without that, it averages over every version in its training data and hands you a confident blend that compiles in none of them.

Make type-checkers and linters the automatic guardrail

This is the part people skip and the part that matters most. A type-checker is a hallucination detector that never gets tired. If the model invents user.fullName on a type that only has firstName and lastName, TypeScript fails the build before the code ever runs. Anthropic's 2026 guidance on agentic coding pushes exactly this: put fast deterministic gates (formatter, linter, tsc, unit tests) in the agent's inner loop so it advances only when checks pass. Run an agent against a failing test or a red type-checker and it will keep correcting until reality agrees with it. The lesson generalizes: prefer strictly-typed stacks, turn on strict mode, and never silence an error you do not understand just to make the agent happy.

Ask it to cite its sources

A cheap, effective trick: "For each function you call, tell me which file or which doc it comes from." Forcing attribution surfaces the bluff. If the model writes "I'm using the parseISO helper from date-fns" you can check that in a second; if it cannot say where something comes from, that is your tell to verify before trusting.

Prefer well-known libraries

The model has seen React, Postgres, and requests a million times and obscure libraries almost never, so its guesses about the popular ones are far more likely to be right. This also defends against a real attack class: slopsquatting. Models invent package names, and the same fake names recur predictably, so attackers pre-register them on npm and PyPI. Research across 16 models found commercial ones hallucinate packages around 5% of the time and open-source ones around 20%, with a large share of fabricated names repeating across runs. In early 2026 a hallucinated npm package, react-codeshift, was seen propagating through AI tooling that tried to install it. Never let an agent npm install a name you have not eyeballed.

Practical takeaway

  • Feed it the real code and real, version-correct docs. Do not let it work from memory.
  • One task per prompt, small diff, commit, repeat.
  • State and pin your versions; let it read package.json.
  • Wire tsc, the linter, and tests into the loop as automatic fact-checkers.
  • Ask "where does this come from?" and verify every new dependency before installing.

The verification loop: do not trust, verify

Every AI building session is the same shape, repeated: you generate, then you verify. The generate step is loud and fun. The verify step is quiet and boring, and it is where the actual quality lives. Skip it and you are not building software, you are accumulating a pile of plausible-looking text that compiles and might even run, right up until it deletes the wrong rows.

The single most important thing to internalize: the AI saying it is done is not the same as it being done. Coding agents are trained to produce confident, well-structured summaries. "Done" is, for the model, a conversational move, not a contract. It will write the files, narrate its progress, print a tidy checklist with green checkmarks in the chat, and stop. You check, and the tests are red. This is not a rare edge case. Industry write-ups through early 2026 keep landing on the same number: roughly 40 to 45 percent of AI-generated code needs human fixing even after it has supposedly passed review. The model is not lying to you. It genuinely does not know whether it is done, because it never ran anything.

Make the AI prove it, not claim it

The fix is to stop accepting claims and start demanding evidence. For every meaningful change, the loop is: run the code, run the tests, look at the actual output, then check it against what you asked for. Not "does the agent say it works," but "did I watch it work."

  • Run it. Modern agents can do this themselves. Cursor's agents and cloud agents will run your test command, read the failing output, fix, and retry. Tell them to: a line like "run pnpm test after every change and do not stop until it passes" turns "done" from a vibe into a gate.
  • Look at the real output. For anything with a UI, read the screenshot or click through it yourself. As of early 2026, tools like Cursor's browser-using cloud agents, plus Lovable, Bolt, and v0, can render the running app and hand you a live preview. A green test suite and a broken button are fully compatible states.
  • Check against the spec. Re-read your original request next to the diff. Agents love to solve a nearby, easier problem: you asked for "soft delete with a 30-day window," you got a hard DELETE. Tests pass because the agent wrote the tests too.

A concrete example

You ask an agent to "add a coupon code field to checkout that applies a percentage discount." It returns a clean diff, a passing test, and a confident summary. You almost merge it. Instead you actually run the flow: you type SAVE20, and the total drops 20 percent. Good. Then you type save20 in lowercase: full price. You type SAVE20 twice: 40 percent off, stacking infinitely. You type a coupon for a product not in the cart: still applies. The test the agent wrote checked exactly one path, the happy one it had in mind. Three minutes of poking found three real bugs. That three minutes is the job. The generate step gave you a draft; the verify step gave you software.

Human checkpoints at the risky seams

Most changes can be verified fast and cheaply. A few cannot be un-done, and those deserve a hard stop where a human reads every line before it runs. Treat these as non-negotiable manual checkpoints regardless of how green the tests are:

  • Auth and permissions: who can see and do what. A subtle role check the agent "simplified" is a data breach.
  • Payments and billing: charges, refunds, currency, the stacking-coupon problem above.
  • Data deletion and migrations: anything that drops, overwrites, or mass-updates rows. Demand a dry run and a backup first.
  • Anything irreversible: sending real emails to real users, posting to production APIs, deleting files.

For these, never let an agent run the destructive command against production unsupervised. Have it print the plan, the affected count, and the rollback path, then you press the button.

Takeaway: trust the agent to draft, never to grade its own work. Verification is not a phase at the end, it is the other half of every single loop. Run it, see it, check it against the spec, and put a human at every seam you cannot take back.
Subagents and orchestration3 chapters · ~11 min

Subagents: what they are and when they help

By now you have a mental model of one AI agent: it reads your prompt, looks at your code, edits files, runs commands, and reports back. A subagent is the same idea, scoped down. It is a separate AI agent that the main agent (or you) spins up to handle one focused piece of work, with its own fresh context window and its own task. Think of the main agent as a tech lead and subagents as people it delegates bounded jobs to: "you write the tests for the billing module, you audit the auth code for security holes, you go read the docs and come back with a summary."

The key word is context. Every agent has a limited working memory (the context window). The more junk you cram into it (long files, failed attempts, unrelated tangents), the worse it reasons. A subagent starts clean. It only sees the slice of the problem you hand it, does the job, and returns a tidy result. The main agent never has to wade through the subagent's messy intermediate steps; it just gets the answer.

What this looks like in real tools

This is not theoretical. Claude Code ships built-in subagents: the lead agent can split a task and fan out separate Claude instances, each with its own context, tools, and even a different model, then stitch the results together. As of mid-2026 Anthropic extended this with "dynamic workflows" where one session can spawn tens or hundreds of parallel subagents. Cursor added a similar pattern: its /multitask command (shipped in version 3.2, around April 2026) breaks a job into chunks and runs them as parallel subagents, with up to eight background agents working in isolated git worktrees so they do not trample each other's files. See the official docs at code.claude.com and cursor.com.

When subagents actually help

  • Genuinely independent parallel work. Three modules that each need tests, or a rename that touches forty files in separate folders. Each subagent owns a slice, they run at once, you save wall-clock time.
  • Keeping context clean. A noisy job (debugging a flaky test, reading a 2,000-line legacy file) can be quarantined in a subagent so the main agent's reasoning stays uncluttered. The subagent burns through the mess and hands back only the conclusion.
  • Broad search and research. "Find everywhere we call the old payments API" or "compare three libraries for this." Fan out, gather, summarize. This is the highest-value use, because search naturally produces a lot of throwaway intermediate text you do not want polluting the main thread.

When they are overkill

Most tasks are sequential and small, and subagents add coordination overhead, cost, and a layer of indirection that makes things harder to follow, not easier. If step two depends on the output of step one, there is nothing to parallelize. Spinning up five agents to fix one typo is theater. Subagents also cannot see each other's work mid-flight, so any task where the pieces need to negotiate (two parts of an API that must agree on a shared shape) is usually better done in one focused thread, or with the main agent defining the shared contract first.

A concrete example

Say you are adding a "delete account" feature to a SaaS app. A single agent would do it step by step. Split into subagents, it might look like this: subagent A writes the backend endpoint and database cleanup logic; subagent B writes the frontend confirmation modal and wiring; subagent C audits the change for security and privacy gaps (does it really delete the data, does it leave orphaned records, does it respect the data-retention rules). A and B are independent enough to run in parallel; C runs after, reviewing both with a fresh, unbiased context. The main agent defines the API contract up front so A and B agree, then integrates and runs the full test suite. You get speed on the build and a genuinely independent reviewer, which a single agent (already attached to its own code) tends to do poorly.

Takeaway: reach for subagents when the work is genuinely parallel, when a subtask would otherwise pollute the main context, or when you want a fresh-eyed reviewer. For a simple linear change, one agent is faster and easier to trust. Split the work, do not split a one-liner.

Orchestration patterns for multi-agent work

Once you trust a single agent to write a function, the obvious next thought is: why not run ten of them? Sometimes that is a real speedup. Often it is a way to manufacture chaos faster than you can read it. The difference is whether you picked the right orchestration pattern for the shape of the work. There are four worth knowing, and they are not interchangeable.

Four patterns, four jobs

  • Parallel fan-out. Many agents work on genuinely independent pieces at the same time, then you merge. The classic case: you are changing how 50 components import a module. Each agent owns a few files, nobody touches the same file, and you stitch the results together. Cheap, fast, and only safe when the pieces truly do not overlap.
  • Pipeline. Each item flows through ordered stages, like a factory line. Draft, then critique, then rewrite, then format. One agent's output is the next agent's input. Good when the work has natural sequential phases.
  • Orchestrator-worker. One coordinator holds the plan, decides what to spawn, hands each worker a self-contained task, and synthesizes the results. Anthropic's own Research feature works this way: a lead agent plans, spins up roughly three to five subagents in parallel, and then runs a separate pass to assemble and cite. The workers do not talk to each other. Every decision about what happens next lives in the orchestrator.
  • Verify / critic panels. Agents whose only job is to check other agents. A "red team" agent that hunts for bugs, a fact-checker that flags unsourced claims, a reviewer that rates output against a rubric. They do not produce the work; they raise the floor on its quality.

This article as the worked example

The thing you are reading is an orchestrator-worker job. The article has 40-odd chapters. Each chapter was written by a separate subagent with a fresh context window and a tight brief: a chapter number, a title, a list of points to hit, the house style rules, and an instruction to web-search for current facts. None of those subagents could see the others. The one writing this chapter did not know what Claude Code said or whether keeping the context focused would repeat it.

That isolation is the whole trick. A single agent trying to hold 40 chapters in one context would drift: forget the no-em-dash rule by the pricing section, contradict itself on pricing, lose the voice. Forty focused agents each holding one chapter stay sharp. Then an orchestrator merges them, checks the seams, deduplicates overlap (two chapters both reaching for the same Cursor example), and enforces consistency the workers could not coordinate on themselves.

The fan-out part is obvious: 40 chapters written at once is far faster than 40 written in sequence. The hidden cost is just as real. Because workers cannot see each other, three of them will independently decide the Anthropic Research system is the perfect illustration, and the orchestrator has to referee. That deduplication and seam-checking is coordination cost, and it does not go away. It moves up to the orchestrator and to you.

More agents is not better

Anthropic reported their multi-agent Research setup beat single-agent Opus by a wide margin on an internal eval, but at roughly 15x the token cost of a normal chat. That ratio is the honest headline of multi-agent work: you are trading money and complexity for parallelism and quality, and the trade only pays when the task genuinely decomposes.

The failure mode is spawning agents on work that does not split. If two agents must edit the same file, you get merge conflicts instead of speed. If a task is really sequential (you cannot write the tests until the API exists), fan-out just produces confident, wrong, parallel guesses. And every extra agent adds a context to manage, an output to read, and a chance for one bad worker to poison the merge.

Pick the pattern from the shape of the work, not the other way around. Independent pieces want fan-out. Ordered phases want a pipeline. Open-ended research wants an orchestrator. Quality-sensitive output wants a critic.

Practical takeaway:

  • Start with one agent. Only fan out when you can name the independent pieces and prove they do not touch the same files.
  • Give every worker a self-contained brief and a fresh context. Do not assume they share knowledge; they do not.
  • Budget for coordination: someone (an orchestrator agent, or you) must merge, deduplicate, and check the seams. That work is not free.
  • Add a critic pass before you trust the output, especially when the agents wrote in isolation.

Context management: focus beats firehose

The instinct, when you want the AI to do a good job, is to give it everything: paste the whole file, the whole error log, the whole Slack thread, the entire schema. More context, better answer, right? It is one of the most expensive mistakes you can make. A model's context window is not a hard drive you fill up. It is closer to working memory, and working memory degrades under load. The skill that separates people who ship with AI from people who fight it all day is knowing what to leave out.

Context rot is real and it kicks in early

"Context rot" is the measured tendency of models to produce worse output as the input grows, well before they hit their advertised limit. Chroma's 2025 study tested 18 frontier models and found every one of them degraded as input length increased. The headline number on a model is the ceiling, not the comfortable working range: a 200K-token model often starts getting unreliable somewhere around 130K, and accuracy on retrieval-style tasks can slip far earlier than that. The failure mode is nasty because it is silent. The model does not say "I am confused now." It just quietly starts ignoring an instruction from 40 messages ago, or confidently answers using a stale version of a function you already rewrote.

That last part is the link to hallucination. A bloated context is full of contradictions: the old code and the new code, the rejected approach and the accepted one, three different error messages from three different bugs. When you ask for a fix, the model has to guess which reality is current, and a wrong guess looks exactly like a confident, well-formatted hallucination. Focused context does not just save money. It removes the raw material that bad answers are built from.

Techniques that actually work

  • Start a fresh session for a new task. The cheapest, highest-leverage move. When you finish fixing the auth bug and move on to the export feature, do not keep typing in the same thread. The auth debugging is now pure noise. In Cursor or Claude Code, open a new conversation. The quality jump is immediate and obvious.
  • Give each agent only what it needs. If you use sub-agents (one to write code, one to review, one to run tests), do not hand each one the full transcript. The reviewer needs the diff and the spec, not your forty earlier prompts. Scoped context per agent is the whole point of splitting work up.
  • Summarize and hand off. When a long session has produced real decisions, ask the model to write a short summary of what was decided and why, then start fresh and paste that summary in. You keep the conclusions and drop the deliberation. Anthropic's server-side compaction (in the Claude API as of 2026) automates a version of this: it summarizes the conversation as it nears the limit, and the related context editing feature clears stale tool results like old file reads and screenshots while keeping the decisions they led to.
  • Use docs and specs instead of pasting everything. Keep a short architecture note, a spec, or a CLAUDE.md / rules file in the repo. Point the model at it. A tight 300-line spec the agent reads on demand beats 8,000 lines of source pasted into chat, every time.

A concrete example

Say you are building a Stripe checkout flow. The firehose approach: paste your entire server.js, the full Stripe webhook docs, the previous failed attempt, and a screenshot of the dashboard, then ask "why is the payment failing?" The model now juggles three thousand lines and two conflicting versions of your handler, and it suggests a fix for code you deleted yesterday.

The focused approach: a new session, the single handleWebhook function (40 lines), the exact error string, and one sentence of context ("Stripe test mode, signature verification"). The model spots the missing stripe.webhooks.constructEvent raw-body issue in one shot, because there is nothing else competing for its attention. Same model, same problem, completely different outcome. The tooling reflects this: in early 2026 Cursor moved away from stuffing long tool outputs into the window and instead writes them to files the agent greps on demand, turning passive context stuffing into active, on-demand retrieval.

Takeaway

Treat context as a budget you spend deliberately, not a bucket you top up.

  • New task, new session. Stale context is not free, it actively hurts.
  • Paste the smallest thing that makes the question answerable: the one function, the one error, the relevant spec.
  • When a session gets long, summarize the decisions, then start clean.
  • If output quality is dropping mid-session, suspect context rot before you blame the model.
From localhost to live3 chapters · ~12 min

What "localhost" really means, and why it is not shipped

You typed a prompt, the AI wrote some code, you ran it, and a browser tab opened showing your app. The address bar said something like localhost:3000 or 127.0.0.1:5173. It works. The buttons click, the data saves, it looks real. So you send the link to a friend, and they get an error: "this site can't be reached."

That is the single most common moment of confusion for people new to building software. The thing you are looking at is genuinely working. It is just working only for you, and that is by design.

localhost is your machine talking to itself

localhost is a name that always points back to the computer you are sitting at. Its numeric form, 127.0.0.1, is a special address every machine reserves for "me, myself." When you open localhost:3000, your browser is asking your own laptop for a page, and your own laptop is answering. No router, no internet, no other person involved. The number after the colon (3000, 5173, 8080) is just the "port," a labeled door so multiple programs on one machine can run at once without colliding.

Think of it like cooking dinner in your own kitchen. The food is real and it tastes great. But nobody else can eat it until you actually serve it somewhere they can reach. localhost is your kitchen. Nobody else has the address, because the address literally means "wherever I happen to be standing."

Live means a public address the whole internet can reach

For other people to load your app, the code has to run on a computer that is always on, always connected, and has a public address (a domain like yourapp.com or a provided URL like yourapp.vercel.app). That computer is a server, and putting your code there is "deploying" or "shipping."

This is exactly the step modern vibecoding tools try to collapse. With Lovable, Bolt, Replit, or v0 (which publishes to Vercel), there is usually a single "Publish" or "Deploy" button that takes your localhost app and hands it a real public URL. As of mid-2026 that step is often genuinely one click. The confusion is not that shipping is hard with these tools, it is that people do not realize the running preview and the shipped site are two different things living in two different places.

Why "it works on my machine" is not a shipped product

Even after you click deploy, the live version can break in ways the local one never did. The reason is that your laptop and the server are not the same environment, and your code quietly leaned on things only your laptop had. The usual culprits:

  • Secrets. Your API keys and passwords live in a file on your machine (often called .env) that, correctly, never gets uploaded. The server starts with those values blank unless you set them there too. This is the number one reason a freshly deployed app shows a white screen or a "500" error.
  • Dependencies and versions. Your machine might have a slightly different version of a tool or library than the server installs. A function that exists locally may be missing live.
  • Data. Locally you talk to a tiny test database full of your fake sample rows. Live, that database does not exist until you create and connect a real one.
  • The base URL. Code that hardcodes localhost:3000 somewhere will keep pointing at the visitor's own machine once it is live, where nothing is listening.

The mental model: dev versus production

Hold two pictures in your head. Development (dev) is your private workshop: your laptop, fake data, secrets in a local file, fast and forgiving, only you watching. Production (prod) is the public storefront: a server, real users, real data, real secrets, and real consequences when something breaks. They run the same code, but almost nothing else about them is the same.

A concrete example. You vibecode a tip-splitting app. On localhost it adds up the bill perfectly because it reads from a sample receipt baked into the code and uses a currency API key sitting in your .env file. You deploy. Your friend opens the live link and the page loads but every total shows $NaN. Nothing is broken in the code. The server just never got the currency API key, so the conversion call fails and the math falls over. You add the key to your host's environment settings, redeploy, and it works for everyone. Same code, different environment, completely different outcome.

Takeaway: "it runs on localhost" means the code is written. "It runs in production for someone who is not you" means it is shipped. The gap between those two is environment, not effort: secrets, data, and versions that lived on your machine and have to be recreated on the server before anyone else can use what you built.

Hosting: where your app actually lives when it goes live

Your AI assistant can write the code. It cannot put it on the internet for you. At some point a real machine somewhere has to answer requests from real users, and "where does it run" is a decision that quietly shapes your cost, your speed, and how much 2am stress you sign up for. The good news: in 2026 you mostly pick one of three lanes, and the right lane is usually obvious once you know what your app actually is.

The three lanes

Lane 1: static and frontend hosts. These serve a built bundle of HTML, CSS, and JavaScript from a global edge network, plus small serverless functions for the dynamic bits. This is where Vercel, Netlify, and Cloudflare Pages live. You connect a GitHub repo, they build on every push, and your site is live on a real URL in minutes with HTTPS handled for you. Vercel is the smoothest path for Next.js. Cloudflare Pages is the standout on cost: as of mid-2026 its free tier still offers unlimited bandwidth with no egress fees, so a traffic spike does not generate a surprise bill. Netlify and Vercel both have free hobby tiers, but they have tightened over time (Netlify now meters with monthly build credits, Vercel's hobby plan is restrictive on functions), so read the current limits before you commit.

Lane 2: full-app platforms. When you have a long-running backend (a Node or Python or Go server, a database, a background worker), you want a Platform-as-a-Service: Railway, Render, or Fly.io. They take your repo or a Dockerfile, run it as an always-on service, and give you a managed Postgres database next to it. As of mid-2026 the free-tier picture has shifted: Render still keeps a real permanent free tier (web service plus a Postgres, no credit card), while Railway runs on a small trial credit then roughly $5/month, and Fly.io has dropped its old free allowance for new accounts (figure on roughly $2 to $5/month for a small always-on machine). Fly is the one to reach for when latency by region matters, since it runs your app close to users in multiple cities.

Lane 3: a VPS you manage yourself. A virtual private server is a bare Linux box you rent and configure: install nginx, run your app behind it, set up a database, manage TLS certificates, do your own updates and backups. Providers like Hetzner (around €4/month for a small box in mid-2026) and DigitalOcean (Droplets from about $4 to $6/month) are the usual picks. You get full control and the lowest cost per unit of compute. You also get full responsibility. This is the most you will learn and the most that can break at the worst time.

Serverless versus always-on, in one breath

Serverless functions spin up only when a request arrives and you pay per use, which is cheap at low traffic but means the first request after idle can be slow (a "cold start"). Always-on services run continuously, cost the same whether or not anyone visits, and respond instantly. Lanes 1 and 2 lean serverless and always-on respectively; the VPS is always-on by definition.

A concrete mapping

Say you built a small SaaS with an AI assistant: a React frontend, a Python API, a Postgres database, and a nightly job that emails users a digest. A clean first-launch setup is Cloudflare Pages or Vercel for the frontend, and Render for the API plus Postgres plus the scheduled job. That is two dashboards, a generous-to-modest monthly bill, and almost nothing to maintain. If a year later you are paying hundreds a month and want tighter control, you migrate the backend to a Hetzner VPS with nginx and a managed Postgres, and keep the frontend on the edge host. You move to the VPS for a reason, not as a starting point.

Practical takeaway

  • Static or frontend-heavy site: Cloudflare Pages (cheapest at scale) or Vercel (best Next.js experience).
  • Real backend plus a database, first launch: Render or Railway. Optimize for "it just deploys," not for saving $4.
  • Cost or control matters more than convenience, and you can run a Linux box: a Hetzner or DigitalOcean VPS.
  • Always check the current free tier the week you launch. These limits move, and an outdated assumption is exactly how you get a bill or an outage you did not plan for.

Domains, secrets, and the go-live checklist

You have an app that works on your machine. Going live is the last mile, and it is where confident builders quietly fumble: a leaked API key, an unreachable domain, a "not secure" warning that scares away your first ten users. None of it is hard. It is just a sequence of small, specific steps. This chapter walks the sequence.

What a domain actually is

Your server lives at an IP address, something like 76.76.21.21. Nobody wants to type that. A domain (equityflow.com) is a human-friendly name that points to that IP via DNS, the internet's phone book. When someone types your domain, their browser asks a DNS resolver "what is the address for this name?", gets the IP back, and connects. The mapping lives in DNS records: an A record points a name to an IPv4 address, a CNAME points one name at another name (handy when your host gives you myapp.vercel.app instead of a raw IP).

You buy a domain from a registrar. Solid, no-nonsense options as of mid-2026: Namecheap and Cloudflare Registrar (which sells domains at cost, no markup). Expect roughly $10 to $15 a year for a .com. Ignore the upsells; you do not need their email or "site protection" packages.

Pointing it is two clicks plus a wait. If you deploy on a managed host, it tells you exactly what record to add. On Vercel, for example, you add your domain in the project settings, and it shows you the A or CNAME record to paste into your registrar's DNS panel. DNS changes propagate in minutes to (occasionally) a couple of hours, so do not panic if your domain is not instant.

HTTPS and the padlock

The padlock in the address bar means traffic between your user and your server is encrypted with TLS, so passwords and tokens are not sent in plain text. Without it, browsers slap a "Not secure" label on your site, and modern browsers increasingly refuse features outright on plain HTTP.

The good news: you almost never configure this by hand anymore. Let's Encrypt is a nonprofit that issues TLS certificates for free, and every serious host automates it. Vercel, Netlify, Railway, and Cloudflare all provision and auto-renew certificates the moment your domain points at them. If you are running your own server (a VPS), a tool like Caddy gets you valid HTTPS with essentially zero config. The era of paying $80 for an SSL certificate is over.

Secrets: the one thing you must not get wrong

Your app needs secrets: a database password, a Stripe key, an Anthropic or OpenAI API key. The cardinal rule is simple. Secrets never go in your git repository. Not in the code, not in a config file, not "temporarily." Once a key is committed, treat it as compromised even after you delete it, because git history and bots scraping public repos are relentless. People have run up five-figure cloud bills overnight from a single leaked key.

Instead, secrets live in environment variables: key-value pairs the running process reads at startup (process.env.STRIPE_KEY, os.environ["STRIPE_KEY"]). Locally you keep them in a .env file that is listed in .gitignore so it is never committed. In production, you set them in your host's dashboard: Railway has a Variables tab, Vercel and Netlify have an Environment Variables section. For a team juggling many keys across services, a dedicated secrets manager like Doppler or AWS Secrets Manager syncs one source of truth to every platform, but a solo founder does not need that on day one.

Practical habit: commit a .env.example file with the key names and dummy values, so future-you (or an AI agent you hand the repo to) knows what to fill in, without ever exposing the real values.

The go-live checklist

  1. Build cleanly. Run the production build locally first (npm run build or equivalent). If it breaks here, it breaks in deploy.
  2. Set environment variables in the host, for production specifically. The most common launch bug is a key that exists locally but was never added to the live environment.
  3. Deploy to a staging environment first. Most hosts give you a preview URL per branch for free. Test against real keys there.
  4. Point the domain and add the DNS record the host gives you.
  5. Confirm HTTPS is active (look for the padlock; the host usually does this automatically once DNS resolves).
  6. Promote to production by merging to your main branch or clicking promote.
  7. Monitor. Wire up basic error tracking (Sentry's free tier is plenty) and an uptime check so you hear about outages before your users tweet about them.
  8. Know your rollback. Every managed host keeps previous deployments; redeploying the last good one is one click. Rehearse it once before you need it at 2am.

Keep staging and production genuinely separate, ideally with separate database and separate keys, so a test never touches real customer data.

Founder to founder

Shipping feels heavier than it is. The whole stack above is free to start, automated by default, and reversible. The discipline that matters is not technical brilliance; it is the boring stuff: keep secrets out of git, test in staging, and make sure you can roll back. Do those three things and you can launch on a Tuesday afternoon, watch the padlock turn green, and send the link to your first real user without holding your breath. Then go fix the next bug. That is the job.

This path assumes you take security into your own hands. The security guide is essential reading before you put any of this in front of real users.

Take this with you
Architect a project for the long run, as a downloadable skill

Give it to your AI agent (Claude Code, Cursor, Copilot) and it applies this to your own project, locally.

↓ Download skill All skills →
← All deep dives