Vibe Coding

Feb 23, 2026 6 min read

Vibe Coding in Regulated Production

Ship Like a Maniac, But Bring Receipts

“Vibe coding” is the new caffeine: intoxicating, productivity-boosting, and capable of ruining your week if you treat it like water.

In a hackathon, vibe coding is a flex. In production at scale—especially in regulated industries—it’s either a competitive advantage or a creative way to summon auditors, outages, and reputational damage in one elegant PR.

Here’s the thing founders tend to miss (because founders are built out of optimism and cortisol):

Regulated doesn’t mean slow. Regulated means you can’t blag it.

You can absolutely move fast. You just need to move fast with evidence, not vibes.

This article is a founder-facing playbook for how to do exactly that.

The real problem isn’t AI-written code. It’s unverified change.

LLM-generated code has a specific failure mode: plausibility.

It reads clean. It compiles. It passes the tests you thought to write. It “looks right”.

And then it detonates in production because:

the edge case you didn’t anticipate is now happening 10,000 times per minute,
a “harmless” query turns into a database lock festival,
concurrency turns your lovely logic into a probabilistic crime scene,
a rollout quietly changes behaviour in a way you don’t notice until revenue drops or complaints spike.

🔬 Treat code as a hypothesis.

🧪 Treat production behaviour as the experiment.

📊 Treat evidence as the only thing that counts.

Your job is to satisfy three audiences (and none of them care about your vibes)

When you ship production systems in regulated environments, you’re always answering to:

👤

Users

"Does it work? Is it fast? Does it feel good?"

🚨

The Pager

"Does it break? Can we recover quickly?"

📋

Auditors

"Can you show what changed, why, and how you controlled risk?"

Founders often optimise for #1, sometimes #2, and treat #3 like a box-ticking exercise.

That’s backwards.

If you solve #3 properly, you often get #2 “for free”, and #1 becomes easier because you can ship more aggressively without fearing your own shadow.

The trick is to turn “compliance” into an engineering artefact, not a meeting series.

Compliance isn’t paperwork. It’s traceability + control

People hear “regulated” and imagine some Dickensian clerk demanding a Word document with 17 headings.

In reality, what you need is boring, mechanical, and automatable:

The Five Pillars

Traceability — what changed, when, by whom, and why

Control — what gates existed, and whether they functioned

Replayability — can we reconstruct what the system did with the info we had at the time

Detectability — can we notice when reality diverges from intent

Recoverability — can we undo damage quickly

None of that requires heavy bureaucracy. It requires systems.

If you want to vibe code in production, you need to accept one brutal truth:

You don't get to ship code.

You get to ship evidence that the change is safe.

The Evidence Bundle

core patterns

If you do one thing after reading this, do this:

Every production-impacting change produces an Evidence Bundle automatically.

Not a wiki page. Not a Slack message. A generated artefact tied to the deploy.

Your Evidence Bundle should answer:

📦 The Evidence Bundle

⟐

What changed? — diff, config, prompts, model ID/routing, infra

⟐

Why? — intent + scope

⟐

What could go wrong? — risk class + failure modes

⟐

What evidence did we gather? — tests, evals, perf, security checks

⟐

What guardrails exist? — alerts, SLOs, auto-rollbacks, canary plan

⟐

How do we roll it back? — explicit rollback steps, not "git revert and pray"

⟐

What do we watch post-deploy? — dashboards + thresholds

This transforms “we move fast” from a vibe into a machine.

And—this is the part founders should care about—it makes your org more fearless, which makes you faster.

Not all changes deserve the same brakes

The single most common operational mistake I see in “fast” teams is treating every commit like it has the same blast radius.

Regulated environments don’t punish speed. They punish reckless uniformity.

So risk-tier changes and gate accordingly. Keep it simple:

Cosmetic

Low stakes. Ship it.

Behavioural

Logic changes, prompts, routing, model selection.

Stateful

Schema changes, backfills, idempotency-sensitive workflows.

Catastrophic

Auth, payments, data deletion, safety filters. Anything that can permanently harm users or your balance sheet.

Then enforce different gates per tier.

The goal isn’t to slow down. The goal is to apply friction only where irreversible damage is possible.

That’s how you keep velocity without turning your production environment into a casino.

The dirty secret of AI products: behaviour is part of your API

In traditional systems, “regression testing” is mostly deterministic: code in → output out.

In AI-heavy systems, you are shipping stochastic behaviour:

prompts change outcomes
routing policies change outcomes
model updates change outcomes
safety filters change outcomes
tool calls fail weirdly
distribution shifts over time

So if you’re vibe coding an AI product and you don’t have behavioural regression tests, you’re basically doing surgery with a blindfold and calling it “bold”.

What works:

Golden Sets

Curated prompts and scenarios with expected properties.

Metamorphic Tests

Paraphrase inputs; the output should stay within bounds.

Invariants

No secrets, no policy violations, bounded refusal rate, stable formatting.

Shadow Traffic

Compare new vs old on real requests without affecting users.

Distribution Monitoring

Output drift, refusal drift, toxicity drift, latency drift — the signals that tell you reality has shifted under your feet.

I’ve used multi-agent evaluation patterns where the system has to argue for correctness—e.g., native-speaker debate panels for translation quality and cultural bias detection—because “sounds plausible” is a trap at scale.

This is the point: you don’t need more “testing”. You need the right kind of testing.

The “highly regulated” bit: audits want replay, not opinions

⚖️ When Something Goes Wrong, You'll Be Asked

1 What did you know at the time?

2 What controls existed?

3 Did you follow them?

4 Can you reconstruct the decision path?

So build systems that make those questions boring:

version everything that affects behaviour (code, config, prompts, model IDs, routing)
correlate user-visible outcomes with deployed versions (trace IDs + release IDs)
immutable deploy records (who/what/when/why)
reproducible builds
data lineage for training/fine-tuning/evaluation sets

This is not “governance theatre”. It’s an operational superpower.

In practice, this is how you keep agentic automation safe in environments where “oops” is not an acceptable incident classification. I’ve shipped multi-agent systems in insurance contexts where mistakes are expensive and scrutiny is real—what makes it sustainable is converting risk into controls that are executable, observable, and reviewable.

Migrations and backfills are where speed goes to die (unless you treat them like explosives)

Here’s an unsexy truth: your fanciest AI architecture is irrelevant if your data layer becomes a haunted house.

Stateful changes deserve their own discipline:

expand/contract schema migrations
dual-write / dual-read transitions where needed
backfills with throttling, checkpointing, and progress visibility
idempotency everywhere
the ability to pause, resume, and roll back without improvisation

This is also where agentic coding is most dangerous: an LLM can “help” you write a migration in seconds… and destroy weeks of data integrity in one deploy.

🔐

If it's irreversible, it needs a two-key turn.

Human gate + automated gate. No exceptions.

You’re not distrusting AI. You’re respecting entropy.

The founder’s cheat code: make rollback a muscle, not a myth

Most teams have rollback. Few teams can rollback under pressure without making it worse.

At scale, rollback must be:

fast (minutes)
safe (idempotent, no half-migrated state)
rehearsed (you’ve done it when nothing was on fire)

This is the paradox founders love:

The Paradox

The teams that roll back quickly ship faster.

Because they're not paralysed by fear. They know failure is recoverable.

Because they’re not paralysed by fear. They know failure is recoverable.

If you want to vibe code, you must make failure cheap.

A practical “vibes → production” pipeline that actually works

Prototype fast

Vibes allowed.

Auto-generate an Evidence Bundle

Mechanical. Tied to the deploy.

Risk-classify the change

R0 – R3. Route accordingly.

Gate accordingly

Tests + behavioural evals + perf + security.

Deploy with control

Canary/shadow, explicit rollback plan.

Monitor with teeth

SLOs + alerting + auto-rollback triggers.

Feed incidents back into eval sets

Tomorrow's regressions become today's tests. The loop closes.

Vibe Coding is fine.

Shipping vibes is not.

The closing punch.

Founders love speed because it wins markets. Regulators love evidence because it prevents harm. Production loves to fail because physics is petty.

You can satisfy all three.

Vibe code your prototypes. Vibe code your drafts. Vibe code your internal tools.

But when you ship production at massive scale—especially in regulated industries—make your org allergic to “trust me”.

TL;DR

Ship evidence. Make auditors bored. Make rollbacks boring. Make correctness measurable. Then move like a lunatic.

That’s the game.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news

Vibe Coding in Regulated Production

The real problem isn’t AI-written code. It’s unverified change.

Your job is to satisfy three audiences (and none of them care about your vibes)

Compliance isn’t paperwork. It’s traceability + control

The Evidence Bundle

Not all changes deserve the same brakes

The dirty secret of AI products: behaviour is part of your API

The “highly regulated” bit: audits want replay, not opinions

Migrations and backfills are where speed goes to die (unless you treat them like explosives)

The founder’s cheat code: make rollback a muscle, not a myth

A practical “vibes → production” pipeline that actually works

Vibe Coding is fine.

The closing punch.

Dr Gareth Roberts —

Member discussion

Vibe Coding in Regulated Production

The real problem isn’t AI-written code. It’s unverified change.

Your job is to satisfy three audiences (and none of them care about your vibes)

Compliance isn’t paperwork. It’s traceability + control

The Evidence Bundle

Not all changes deserve the same brakes

The dirty secret of AI products: behaviour is part of your API

The “highly regulated” bit: audits want replay, not opinions

Migrations and backfills are where speed goes to die (unless you treat them like explosives)

The founder’s cheat code: make rollback a muscle, not a myth

A practical “vibes → production” pipeline that actually works

Vibe Coding is fine.

The closing punch.

Dr Gareth Roberts —

Similar topics

Text Becomes a Syscall: How OpenClaw and Moltbook Turned the Internet into an Operating System