6 min read

Vibe Coding in Regulated Production

Ship Like a Maniac, But Bring Receipts

Vibe Coding in Regulated Production

“Vibe coding” is the new caffeine: intoxicating, productivity-boosting, and capable of ruining your week if you treat it like water.

In a hackathon, vibe coding is a flex. In production at scale—especially in regulated industries—it’s either a competitive advantage or a creative way to summon auditors, outages, and reputational damage in one elegant PR.

Here’s the thing founders tend to miss (because founders are built out of optimism and cortisol):

Regulated doesn’t mean slow. Regulated means you can’t blag it.

You can absolutely move fast. You just need to move fast with evidence, not vibes.

This article is a founder-facing playbook for how to do exactly that.


The real problem isn’t AI-written code. It’s unverified change.

LLM-generated code has a specific failure mode: plausibility.

It reads clean. It compiles. It passes the tests you thought to write. It “looks right”.

And then it detonates in production because:

  • the edge case you didn’t anticipate is now happening 10,000 times per minute,
  • a “harmless” query turns into a database lock festival,
  • concurrency turns your lovely logic into a probabilistic crime scene,
  • a rollout quietly changes behaviour in a way you don’t notice until revenue drops or complaints spike.
🔬 Treat code as a hypothesis.
🧪 Treat production behaviour as the experiment.
📊 Treat evidence as the only thing that counts.

Your job is to satisfy three audiences (and none of them care about your vibes)

When you ship production systems in regulated environments, you’re always answering to:

👤
Users
"Does it work? Is it fast? Does it feel good?"
🚨
The Pager
"Does it break? Can we recover quickly?"
📋
Auditors
"Can you show what changed, why, and how you controlled risk?"
Founders often optimise for #1, sometimes #2, and treat #3 like a box-ticking exercise.

That’s backwards.

If you solve #3 properly, you often get #2 “for free”, and #1 becomes easier because you can ship more aggressively without fearing your own shadow.

The trick is to turn “compliance” into an engineering artefact, not a meeting series.


Compliance isn’t paperwork. It’s traceability + control

People hear “regulated” and imagine some Dickensian clerk demanding a Word document with 17 headings.

In reality, what you need is boring, mechanical, and automatable:

The Five Pillars
01
Traceability — what changed, when, by whom, and why
02
Control — what gates existed, and whether they functioned
03
Replayability — can we reconstruct what the system did with the info we had at the time
04
Detectability — can we notice when reality diverges from intent
05
Recoverability — can we undo damage quickly

None of that requires heavy bureaucracy. It requires systems.

If you want to vibe code in production, you need to accept one brutal truth:

You don't get to ship code.
You get to ship evidence that the change is safe.

The Evidence Bundle

core patterns

If you do one thing after reading this, do this:

Every production-impacting change produces an Evidence Bundle automatically.

Not a wiki page. Not a Slack message. A generated artefact tied to the deploy.

Your Evidence Bundle should answer:

📦 The Evidence Bundle
What changed? — diff, config, prompts, model ID/routing, infra
Why? — intent + scope
What could go wrong? — risk class + failure modes
What evidence did we gather? — tests, evals, perf, security checks
What guardrails exist? — alerts, SLOs, auto-rollbacks, canary plan
How do we roll it back? — explicit rollback steps, not "git revert and pray"
What do we watch post-deploy? — dashboards + thresholds

This transforms “we move fast” from a vibe into a machine.

And—this is the part founders should care about—it makes your org more fearless, which makes you faster.


Not all changes deserve the same brakes

The single most common operational mistake I see in “fast” teams is treating every commit like it has the same blast radius.

Regulated environments don’t punish speed. They punish reckless uniformity.

So risk-tier changes and gate accordingly. Keep it simple:

R0
Cosmetic
Low stakes. Ship it.
R1
Behavioural
Logic changes, prompts, routing, model selection.
R2
Stateful
Schema changes, backfills, idempotency-sensitive workflows.
R3
Catastrophic
Auth, payments, data deletion, safety filters. Anything that can permanently harm users or your balance sheet.

Then enforce different gates per tier.

The goal isn’t to slow down. The goal is to apply friction only where irreversible damage is possible.

That’s how you keep velocity without turning your production environment into a casino.


The dirty secret of AI products: behaviour is part of your API

In traditional systems, “regression testing” is mostly deterministic: code in → output out.

In AI-heavy systems, you are shipping stochastic behaviour:

  • prompts change outcomes
  • routing policies change outcomes
  • model updates change outcomes
  • safety filters change outcomes
  • tool calls fail weirdly
  • distribution shifts over time

So if you’re vibe coding an AI product and you don’t have behavioural regression tests, you’re basically doing surgery with a blindfold and calling it “bold”.

What works:

Golden Sets
Curated prompts and scenarios with expected properties.
Metamorphic Tests
Paraphrase inputs; the output should stay within bounds.
Invariants
No secrets, no policy violations, bounded refusal rate, stable formatting.
Shadow Traffic
Compare new vs old on real requests without affecting users.
Distribution Monitoring
Output drift, refusal drift, toxicity drift, latency drift — the signals that tell you reality has shifted under your feet.

I’ve used multi-agent evaluation patterns where the system has to argue for correctness—e.g., native-speaker debate panels for translation quality and cultural bias detection—because “sounds plausible” is a trap at scale. 

This is the point: you don’t need more “testing”. You need the right kind of testing.


The “highly regulated” bit: audits want replay, not opinions

⚖️ When Something Goes Wrong, You'll Be Asked
1 What did you know at the time?
2 What controls existed?
3 Did you follow them?
4 Can you reconstruct the decision path?

So build systems that make those questions boring:

  • version everything that affects behaviour (code, config, prompts, model IDs, routing)
  • correlate user-visible outcomes with deployed versions (trace IDs + release IDs)
  • immutable deploy records (who/what/when/why)
  • reproducible builds
  • data lineage for training/fine-tuning/evaluation sets

This is not “governance theatre”. It’s an operational superpower.

In practice, this is how you keep agentic automation safe in environments where “oops” is not an acceptable incident classification. I’ve shipped multi-agent systems in insurance contexts where mistakes are expensive and scrutiny is real—what makes it sustainable is converting risk into controls that are executable, observable, and reviewable. 


Migrations and backfills are where speed goes to die (unless you treat them like explosives)

Here’s an unsexy truth: your fanciest AI architecture is irrelevant if your data layer becomes a haunted house.

Stateful changes deserve their own discipline:

  • expand/contract schema migrations
  • dual-write / dual-read transitions where needed
  • backfills with throttling, checkpointing, and progress visibility
  • idempotency everywhere
  • the ability to pause, resume, and roll back without improvisation

This is also where agentic coding is most dangerous: an LLM can “help” you write a migration in seconds… and destroy weeks of data integrity in one deploy.

🔐
If it's irreversible, it needs a two-key turn.
Human gate + automated gate. No exceptions.

You’re not distrusting AI. You’re respecting entropy.


The founder’s cheat code: make rollback a muscle, not a myth

Most teams have rollback. Few teams can rollback under pressure without making it worse.

At scale, rollback must be:

  • fast (minutes)
  • safe (idempotent, no half-migrated state)
  • rehearsed (you’ve done it when nothing was on fire)

This is the paradox founders love:

The Paradox
The teams that roll back quickly ship faster.
Because they're not paralysed by fear. They know failure is recoverable.

Because they’re not paralysed by fear. They know failure is recoverable.

If you want to vibe code, you must make failure cheap.


A practical “vibes → production” pipeline that actually works

1
Prototype fast
Vibes allowed.
2
Auto-generate an Evidence Bundle
Mechanical. Tied to the deploy.
3
Risk-classify the change
R0 – R3. Route accordingly.
4
Gate accordingly
Tests + behavioural evals + perf + security.
5
Deploy with control
Canary/shadow, explicit rollback plan.
6
Monitor with teeth
SLOs + alerting + auto-rollback triggers.
7
Feed incidents back into eval sets
Tomorrow's regressions become today's tests. The loop closes.

Vibe Coding is fine.

Shipping vibes is not.

The closing punch.

Founders love speed because it wins markets. Regulators love evidence because it prevents harm. Production loves to fail because physics is petty.

You can satisfy all three.

Vibe code your prototypes. Vibe code your drafts. Vibe code your internal tools.

But when you ship production at massive scale—especially in regulated industries—make your org allergic to “trust me”.

TL;DR
Ship evidence. Make auditors bored. Make rollbacks boring. Make correctness measurable. Then move like a lunatic.

That’s the game.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news

Member discussion