“Vibe coding” is the new caffeine: intoxicating, productivity-boosting, and capable of ruining your week if you treat it like water.
In a hackathon, vibe coding is a flex. In production at scale—especially in regulated industries—it’s either a competitive advantage or a creative way to summon auditors, outages, and reputational damage in one elegant PR.
Here’s the thing founders tend to miss (because founders are built out of optimism and cortisol):
Regulated doesn’t mean slow. Regulated means you can’t blag it.
You can absolutely move fast. You just need to move fast with evidence, not vibes.
This article is a founder-facing playbook for how to do exactly that.
The real problem isn’t AI-written code. It’s unverified change.
LLM-generated code has a specific failure mode: plausibility.
It reads clean. It compiles. It passes the tests you thought to write. It “looks right”.
And then it detonates in production because:
- the edge case you didn’t anticipate is now happening 10,000 times per minute,
- a “harmless” query turns into a database lock festival,
- concurrency turns your lovely logic into a probabilistic crime scene,
- a rollout quietly changes behaviour in a way you don’t notice until revenue drops or complaints spike.
At scale, reality manufactures edge cases for free. At regulated scale, the bill for being wrong is just… more adult.
So your operating stance has to change:
Treat code as a hypothesis.
Treat production behaviour as the experiment.
Treat evidence as the only thing that counts.
That’s not philosophy. That’s how you keep your company alive.
Your job is to satisfy three audiences (and none of them care about your vibes)
When you ship production systems in regulated environments, you’re always answering to:
- Users — “Does it work? Is it fast? Does it feel good?”
- The pager — “Does it break? Can we recover quickly?”
- Auditors / regulators — “Can you show what changed, why, and how you controlled risk?”
Founders often optimise for #1, sometimes #2, and treat #3 like a box-ticking exercise.
That’s backwards.
If you solve #3 properly, you often get #2 “for free”, and #1 becomes easier because you can ship more aggressively without fearing your own shadow.
The trick is to turn “compliance” into an engineering artefact, not a meeting series.
Compliance isn’t paperwork. It’s
traceability + control
People hear “regulated” and imagine some Dickensian clerk demanding a Word document with 17 headings.
In reality, what you need is boring, mechanical, and automatable:
- traceability: what changed, when, by whom, and why
- control: what gates existed, and whether they functioned
- replayability: can we reconstruct what the system did with the info we had at the time
- detectability: can we notice when reality diverges from intent
- recoverability: can we undo damage quickly
None of that requires heavy bureaucracy. It requires systems.
If you want to vibe code in production, you need to accept one brutal truth:
You don’t get to ship code.
You get to ship evidence that the change is safe.
The core pattern: the Evidence Bundle
If you do one thing after reading this, do this:
Every production-impacting change produces an Evidence Bundle automatically.
Not a wiki page. Not a Slack message. A generated artefact tied to the deploy.
Your Evidence Bundle should answer:
- What changed? (diff, config, prompts, model ID/routing, infra)
- Why? (intent + scope)
- What could go wrong? (risk class + failure modes)
- What evidence did we gather? (tests, evals, perf, security checks)
- What guardrails exist? (alerts, SLOs, auto-rollbacks, canary plan)
- How do we roll it back? (explicit rollback steps, not “git revert and pray”)
- What do we watch post-deploy? (dashboards + thresholds)
This transforms “we move fast” from a vibe into a machine.
And—this is the part founders should care about—it makes your org more fearless, which makes you faster.
Not all changes deserve the same brakes
The single most common operational mistake I see in “fast” teams is treating every commit like it has the same blast radius.
Regulated environments don’t punish speed. They punish reckless uniformity.
So risk-tier changes and gate accordingly. Keep it simple:
- R0: cosmetic — low stakes
- R1: behavioural — logic changes, prompts, routing, model selection
- R2: stateful — schema changes, backfills, idempotency-sensitive workflows
- R3: catastrophic — auth/permissions, payments, data deletion, safety filters, anything that can permanently harm users or your balance sheet
Then enforce different gates per tier.
The goal isn’t to slow down. The goal is to apply friction only where irreversible damage is possible.
That’s how you keep velocity without turning your production environment into a casino.
The dirty secret of AI products: behaviour is part of your API
In traditional systems, “regression testing” is mostly deterministic: code in → output out.
In AI-heavy systems, you are shipping stochastic behaviour:
- prompts change outcomes
- routing policies change outcomes
- model updates change outcomes
- safety filters change outcomes
- tool calls fail weirdly
- distribution shifts over time
So if you’re vibe coding an AI product and you don’t have behavioural regression tests, you’re basically doing surgery with a blindfold and calling it “bold”.
What works:
- golden sets: curated prompts / scenarios with expected properties
- metamorphic tests: paraphrase inputs; the output should stay within bounds
- invariants: no secrets, no policy violations, bounded refusal rate, stable formatting for downstream parsers
- shadow traffic: compare new vs old on real requests without affecting users
- distribution monitoring: output drift, refusal drift, toxicity drift, latency drift
I’ve used multi-agent evaluation patterns where the system has to argue for correctness—e.g., native-speaker debate panels for translation quality and cultural bias detection—because “sounds plausible” is a trap at scale.
This is the point: you don’t need more “testing”. You need the right kind of testing.
The “highly regulated” bit: audits want replay, not opinions
When something goes wrong in a regulated setting, you won’t get asked “did you do your best?”
You’ll get asked:
- What did you know at the time?
- What controls existed?
- Did you follow them?
- Can you reconstruct the decision path?
So build systems that make those questions boring:
- version everything that affects behaviour (code, config, prompts, model IDs, routing)
- correlate user-visible outcomes with deployed versions (trace IDs + release IDs)
- immutable deploy records (who/what/when/why)
- reproducible builds
- data lineage for training/fine-tuning/evaluation sets
This is not “governance theatre”. It’s an operational superpower.
In practice, this is how you keep agentic automation safe in environments where “oops” is not an acceptable incident classification. I’ve shipped multi-agent systems in insurance contexts where mistakes are expensive and scrutiny is real—what makes it sustainable is converting risk into controls that are executable, observable, and reviewable.
Migrations and backfills are where speed goes to die (unless you treat them like explosives)
Here’s an unsexy truth: your fanciest AI architecture is irrelevant if your data layer becomes a haunted house.
Stateful changes deserve their own discipline:
- expand/contract schema migrations
- dual-write / dual-read transitions where needed
- backfills with throttling, checkpointing, and progress visibility
- idempotency everywhere
- the ability to pause, resume, and roll back without improvisation
This is also where agentic coding is most dangerous: an LLM can “help” you write a migration in seconds… and destroy weeks of data integrity in one deploy.
Rule of thumb:
If it’s irreversible, it needs a two-key turn (human + automated gates).
You’re not distrusting AI. You’re respecting entropy.
The founder’s cheat code: make rollback a muscle, not a myth
Most teams have rollback. Few teams can rollback under pressure without making it worse.
At scale, rollback must be:
- fast (minutes)
- safe (idempotent, no half-migrated state)
- rehearsed (you’ve done it when nothing was on fire)
This is the paradox founders love:
The teams that roll back quickly ship faster.
Because they’re not paralysed by fear. They know failure is recoverable.
If you want to vibe code, you must make failure cheap.
A practical “vibes → production” pipeline that actually works
Here’s the non-negotiable shape of a high-velocity, regulated shipping machine:
- Prototype fast (vibes allowed)
- Auto-generate an Evidence Bundle (mechanical)
- Risk-classify the change (R0–R3)
- Gate accordingly (tests + behavioural evals + perf + security)
- Deploy with control (canary/shadow, explicit rollback)
- Monitor with teeth (SLOs + alerting + auto-rollback triggers)
- Feed incidents back into eval sets (tomorrow’s regressions)
This is how you get speed and safety without turning your engineering org into a compliance department.
The closing punch: vibe coding is fine. Shipping vibes is not.
Founders love speed because it wins markets. Regulators love evidence because it prevents harm. Production loves to fail because physics is petty.
You can satisfy all three.
Vibe code your prototypes. Vibe code your drafts. Vibe code your internal tools.
But when you ship production at massive scale—especially in regulated industries—make your org allergic to “trust me”.
Ship evidence. Make auditors bored. Make rollbacks boring. Make correctness measurable. Then move like a lunatic.
That’s the game.
Subscribe to my newsletter to get the latest updates and news
Member discussion