12 min read

The Game Theory of AI Safety Talk

Why what labs say about safety is a strategic signal, not a statement of values — and what that means for regulation.

The Game Theory of AI Safety Talk

Here's something that should make you uncomfortable: the AI labs that talk loudest about safety aren't necessarily the ones taking it most seriously. And the labs dismissing safety concerns? They might be making perfectly rational strategic choices. Understanding why requires looking at AI safety rhetoric not as moral positioning but as game theory — cold, mathematical, and uncomfortably predictive.

Every major AI laboratory faces an identical structural problem. They're building increasingly powerful systems under three brutal constraints: no binding international regulation, intense competitive pressure from rivals with billions in funding, and genuine Knightian uncertainty about catastrophic risks. In this environment, what a lab says about safety becomes a strategic variable — something to be optimised against competitors, regulators, and public opinion. Not merely expressed. Optimised.

This isn't cynicism. It's mechanism design. And once you see it, every safety announcement, every responsible scaling policy, every CEO essay about existential risk starts to look different.


The Signalling Trap

Traditional signalling theory — Spence's job market model, Zahavi's handicap principle in evolutionary biology — delivers a brutal insight: credible signals must be costly. A peacock's tail works as a signal precisely because it's metabolically expensive and makes the bird vulnerable to predators. A degree signals ability because it costs years of effort and foregone income. The cost isn't a bug. The cost is what makes the signal credible.

AI safety rhetoric is cheap. Devastatingly, catastrophically cheap.

Any lab can publish a responsible scaling policy — it costs nothing but a few hours of executive time and a blog post. Any CEO can give a TED talk about existential risk — it generates positive PR and costs nothing operationally. Any company can create a "safety team," issue a press release, sign an open letter alongside competitors. These actions cost essentially nothing relative to the billions being poured into capability research. OpenAI's safety team budget is a rounding error on their compute costs. Anthropic's Constitutional AI papers don't slow their model releases.

When signalling is cheap, equilibrium rhetoric becomes a function of strategic position, not genuine commitment. Words become moves in a game — and cheap moves at that.

This creates a peculiar and deeply troubling dynamic. The volume of safety talk tells you almost nothing about actual safety practices. A lab that publishes ten papers on AI alignment and employs fifty safety researchers might be more or less safe than a lab that publishes nothing and employs no one dedicated to it. You cannot infer behaviour from rhetoric when rhetoric is costless.

What matters instead is the structure of who says what, when, and why — the pattern of cheap talk across the industry and the strategic logic that makes certain labs scream about safety while others stay quiet.


The Strategic Coherence Framework

Every AI lab occupies a position in a two-dimensional strategic space defined by variables that are observable, measurable, and consequential:

  • Capability (C) ∈ [0, 1]: How powerful are their models? Where do they sit on the frontier? Measurable through benchmarks, compute scale, parameter counts, and revealed performance on difficult tasks.
  • Openness (O) ∈ [0, 1]: Do they open-source weights? Publish research freely? Share training methodologies? Observable through their GitHub repos, arXiv submissions, and model release policies.

Given a lab's position (C, O), the framework predicts an equilibrium level of safety rhetoric S*(C, O) that maximises their strategic payoff against competitors, regulators, talent markets, and public opinion. This isn't really a choice — it's a Nash equilibrium. A gravitational attractor. Deviate too far from S* and you face immediate costs: unwanted regulatory attention if you're too quiet at high capability, competitive disadvantage if you're too loud at low capability, talent flight if your positioning doesn't match what researchers want to believe about their employer.

S*(C, O) = αC − βOC − β'(1−O)C + γ𝟙[O < τ_O, C > τ_C]

Equilibrium safety rhetoric as a function of capability and openness.

The first term (αC) reflects that higher-capability labs face more scrutiny and must signal more safety concern to maintain social licence. The second term (−βOC) captures the substitution effect: open labs can point to "transparency as safety" and need less explicit safety rhetoric. The third term (−β'(1−O)C) represents the cost closed labs pay for appearing secretive. The indicator function adds a bonus for closed frontier labs — they're the ones regulators watch most closely, so they invest most heavily in safety theatre.

But rhetoric isn't behaviour. We need a separate function A*(C, O) for equilibrium safety behaviour — actual investments in red-teaming, alignment research, deployment delays, and safety infrastructure. This surface has a different shape: flatter, less responsive to strategic positioning, more determined by genuine technical requirements and less by PR considerations.

The baseline talk–walk gap G*(C, O) = S*(C, O) − A*(C, O) captures the structural inflation of rhetoric over behaviour that the competitive environment generates. This isn't about individual dishonesty — it's about equilibrium forces that push every lab toward saying more about safety than they do about it.


The Four Regimes

The mathematics predict four distinct regions in capability–openness space, each with characteristic rhetoric patterns.

Regime I: The Cautious Frontier · High C, Low O

Labs like Anthropic, OpenAI, and Google DeepMind cluster here. They're building the most powerful systems humanity has ever created while keeping weights closed and methodologies proprietary. Heavy safety rhetoric is strategically optimal along multiple dimensions simultaneously:

  • Regulatory cover: "We take this seriously, don't regulate us — we're already self-regulating."
  • Competitive differentiation: "Unlike those reckless open-source cowboys, we're responsible."
  • Talent attraction: Safety-concerned researchers want to work somewhere they believe cares.
  • Justification for closure: "We're being responsible by not releasing weights — this is dangerous technology."

The rhetoric is often genuine at the individual level — many people at these labs sincerely believe they're doing important safety work. But the structural point is that genuine belief isn't necessary for the equilibrium to hold. Even a lab full of cynics would converge to similar rhetoric because the payoff matrix demands it. The alignment between belief and incentive is what makes this regime so stable — and so hard to distinguish from pure strategic positioning.

Regime II: The Open Frontier · High C, High O

Meta's LLaMA strategy occupies this territory. Open-source frontier models released for anyone to download, fine-tune, and deploy. Here, safety rhetoric becomes more complicated. Too much doom-and-gloom undermines the core value proposition: "openness enables safety through community scrutiny." Too little invites regulatory backlash from politicians who've watched Terminator.

The equilibrium: emphasise different safety arguments. Transparency as safety. Collective oversight beating closed-door development. Democratised access reducing concentration risk. The message shifts from "be afraid, trust us" to "be empowered, verify yourself." This is still strategic positioning — it's just a different optimal point on the manifold.

Regime III: The Quiet Builders · Low C, Low O

Smaller closed labs, well-funded startups, corporate AI divisions at non-tech companies. Limited capabilities, limited openness, limited visibility. Minimal safety rhetoric is optimal — shouting about safety would draw attention they can't handle, invite scrutiny of modest safety investments, divert limited resources from capability building to signalling, and potentially trigger regulatory burdens they're not equipped to bear.

Silence is their equilibrium. They whisper about safety because the payoff matrix punishes them for shouting.

Regime IV: The Open Ecosystem · Low C, High O

Hugging Face, EleutherAI, academic labs, smaller open-source contributors. Safety rhetoric here takes yet another form: community norms, collaborative governance, distributed responsibility. "We're all in this together" rather than "trust our corporate safety team." Different game, different optimal messaging, same underlying strategic logic.


The Integrity Gap: Measuring Talk vs Walk

Once you recognise that rhetoric follows strategy, you can start measuring the gap between what labs say and what they do. Define the integrity gap for lab i:

ΔGᵢ = Gᵢ − G*(Cᵢ, Oᵢ) = (Sᵢ − Aᵢ) − (S* − A*)

Integrity gap: deviation from strategically predicted talk–walk gap.

This measures whether a lab's actual gap between rhetoric and behaviour exceeds or falls short of what the strategic model predicts. A positive ΔG means more talk relative to walk than the equilibrium suggests — strategic inflation beyond what competition alone explains. A negative ΔG means more walk relative to talk — genuine commitment exceeding strategic requirements.

The pattern that emerges is striking: the highest integrity gaps cluster in the high-capability, low-openness region. OpenAI shows the largest positive ΔG — substantially more talk relative to walk than the strategic model alone predicts. Anthropic shows a smaller but still positive gap. DeepMind and Google sit closer to equilibrium. Meta shows a negative gap — more walk relative to talk than predicted, consistent with "openness as safety" being partially genuine rather than purely strategic.

⚠ Critical Finding: The labs that talk most about safety show the largest gaps between rhetoric and behaviour — not because they're uniquely dishonest, but because their strategic position makes cheap talk maximally valuable. The incentives demand inflation.

The Self-Policing Trap

Anthropic's positioning illustrates the structural forces constraining even well-intentioned actors. This isn't about attacking Anthropic — it's about understanding the geometry of the trap.

Dario Amodei has articulated what might be the most sophisticated public case for AI safety from any frontier lab leader. His essays acknowledge genuine existential risk — the possibility that AI systems could pose catastrophic threats to civilisation. Simultaneously, he argues that continued aggressive development is necessary to ensure beneficial outcomes. The logic: if powerful AI is coming regardless, better that safety-focused labs lead than cede ground to less careful actors.

This argument is philosophically coherent. It might even be correct. But it contains a structural tension that becomes clear under game-theoretic analysis.

Consider what Amodei calls the "country of geniuses" scenario — AI systems that could cure diseases, solve climate change, eliminate poverty, extend human lifespan indefinitely. These extraordinary benefits come from the same capabilities that create existential risk. You can't have AI systems smart enough to cure cancer without having AI systems smart enough to potentially manipulate, deceive, or overpower human oversight. The capabilities are fundamentally dual-use.

The capabilities that enable transcendent benefits are mathematically identical to the capabilities that create catastrophic risks. This isn't a design flaw to be engineered away — it's a fundamental constraint on what powerful optimisation systems can do.

This creates the self-policing trap. Anthropic's business model depends on building frontier AI — it's how they make money, attract talent, maintain relevance. Their safety arguments depend on slowing development until alignment is solved — it's their stated mission. These objectives pull in opposite directions with mathematical certainty.

The resolution? A subtle rhetorical move that dissolves the tension linguistically while leaving the game theory untouched: reframe "safety" as "safety leadership." Don't slow down; speed up safely. Don't pause; race to the frontier responsibly. The constraint becomes the competitive advantage. The tension disappears — in words.

But the underlying payoff matrix doesn't change. Anthropic faces identical competitive pressures to every other lab. Their safety investments, however genuine at the individual researcher level, are also strategic positioning at the corporate level. Their rhetoric, however thoughtfully constructed, follows the coherence manifold. It has to — the payoff matrix is the same for everyone.

Structural Insight: A lab full of people who genuinely care about safety will still, at the organisational level, behave as the strategic model predicts. Individual belief doesn't override structural incentives. That's what makes it a trap rather than a choice.

The Dual-Strategy Phenomenon

Some labs have discovered an elegant solution to navigating the manifold: occupy multiple positions simultaneously.

OpenAI provides the clearest example. Through its API and GPT products, it operates as a high-capability, low-openness lab — closed weights, controlled access, extensive safety messaging, responsible deployment rhetoric. Through its occasional open-source releases, research publications, and community engagement, it maintains presence in the open ecosystem — transparency rhetoric, collaborative framing, democratisation narrative.

Lab Position Strategy
OpenAI (C: 0.95, O: 0.25) Dual Positioning Closed frontier models with safety rhetoric + selective open-source with community rhetoric. Maximum strategic flexibility through deliberate positioning ambiguity.
Meta (C: 0.85, O: 0.82) Open Frontier LLaMA strategy emphasises openness-as-safety with genuine structural commitment. Released weights can't be un-released — a credible costly signal.
Anthropic (C: 0.88, O: 0.15) Safety-Differentiated Strongest explicit safety positioning. Constitutional AI, responsible scaling. Rhetoric and business model aligned — or constructed to appear so.
xAI (C: 0.75, O: 0.20) Anti-Safety Signal Explicit rejection of safety discourse. Mocking "AI doomers." Strategic differentiation through contrarian positioning.

This dual-strategy approach creates a portfolio effect. Labs can respond to regulatory pressure by emphasising their closed, safety-focused operations. They can respond to open-source community pressure by pointing to their contributions. They can attract safety-focused researchers by highlighting alignment work while attracting capability-focused researchers by highlighting frontier models. The strategy is brilliant corporate positioning — and completely predictable from the game theory.


Implications for Regulation

If safety rhetoric is strategic signalling in a cheap-talk equilibrium, regulators face a profound epistemic problem: they cannot simply listen to what labs say. The noisiest signals may be the least informative. The labs that sound most responsible may have the largest gaps between words and actions.

The framework suggests four concrete regulatory design principles.

1. Make Safety Costly (The Handicap Principle)

The fundamental problem is that safety talk is cheap. Regulation should make safety practice expensive in ways that create genuine handicap costs — costs prohibitive for labs merely signalling but acceptable for labs genuinely committed. Third-party audits with real teeth and public disclosure. Mandatory red-teaming by independent researchers with adversarial incentives. Required safety testing before deployment with meaningful delays tied to capability thresholds. Pre-deployment compute costs for safety evaluation proportional to training compute.

These create separating equilibria. Labs genuinely committed to safety will accept the costs because they were planning similar investments anyway. Labs merely signalling will resist or comply minimally because the costs exceed the signalling benefits.

2. Index Requirements to Strategic Position

A lab's regulatory burden should depend explicitly on where they sit in capability–openness space — and the framework tells you exactly where to focus scrutiny. High-capability, low-openness labs (Regime I) face the strongest incentives for strategic inflation and the largest predicted talk–walk gaps. They should face the strictest requirements, the most invasive audits, the heaviest disclosure obligations.

Open-source projects with limited capabilities (Regime IV) need lighter-weight frameworks — community governance, voluntary standards, collaborative norms. One-size-fits-all regulation ignores the strategic structure and will be either too burdensome for small players or insufficiently stringent for frontier labs.

3. Measure the Gap Directly

Regulators should develop metrics for the integrity gap and track them publicly. Compare public commitments to observable actions with quantitative rigour. Track the ratio of safety publications to capability publications. Monitor compute allocation: what fraction actually goes to safety versus capabilities? Examine whether "responsible scaling policies" result in actual delayed deployments — ever. Score labs on ΔG and publish the rankings.

The measurement should be embarrassingly concrete. Not "do you have a safety team" but "what's your safety team's budget as a percentage of training compute costs, and how has that ratio changed over the last three years?"

4. Create Race-to-the-Top Dynamics

The current equilibrium involves racing to the capability frontier with safety rhetoric as cover. Regulation should flip this: make demonstrable safety practices a competitive advantage rather than a cost centre. Faster regulatory approval for labs with strong safety track records. Public certification programmes that matter for procurement. Government procurement preferences for verified-safe systems. Insurance requirements that create market pressure for genuine safety investment.

Change the payoff matrix so that genuine safety investment dominates cheap talk. Make walking the walk more profitable than talking the talk.


Testable Predictions

A theory that explains everything explains nothing. The strategic coherence framework generates specific, falsifiable predictions:

  1. Rhetoric adjusts to position changes. If a closed lab open-sources models, their safety messaging should shift toward "transparency enables safety" framing within months. If an open lab closes access, expect increased doom rhetoric. Track labs that change strategy and test whether rhetoric follows.
  2. Regulatory pressure inflates rhetoric without changing behaviour. When governments announce AI safety initiatives, labs should increase safety messaging — but ΔG should widen, not narrow. More words, same allocation of compute. Test by measuring pre/post-announcement rhetoric intensity and investment patterns.
  3. New entrants recapitulate the pattern. Startups entering the frontier should adopt rhetoric matching their position on the manifold within 6–12 months, regardless of founders' prior stated beliefs about safety. The structure is stronger than individual conviction. Track founder interviews pre-funding versus post-frontier.
  4. Internal–external gaps should exist. What labs say publicly should differ systematically from internal communications, with public rhetoric more closely tracking strategic optima. Leaked documents, employee interviews, and departure statements should reveal these gaps.
  5. Safety leaders resist costly verification. Labs with the strongest safety rhetoric should be most resistant to third-party audits, mandatory pre-deployment testing, and external red-teaming with public disclosure — because these would expose the integrity gap. Track lobbying positions on specific regulatory proposals.

The Deeper Problem

None of this means AI safety isn't important — it might be the most important challenge humanity faces this century. None of this means all safety rhetoric is mere positioning — many researchers at every lab are genuinely motivated by concern for humanity's future. Their concern is real. Their technical work is often excellent.

But individual motivation doesn't determine systemic outcomes. The game theory operates regardless of intentions. A lab full of people who genuinely care about safety will still face strategic pressures that shape their public communication. Their rhetoric will still follow the coherence manifold. Their integrity gap will still reflect their structural position in capability–openness space. Good people don't escape bad equilibria through good intentions alone.

Good intentions don't escape game theory. The structure of the game shapes the rhetoric, regardless of what the players believe. That's what makes it a trap rather than a choice.

This is the deepest implication of the framework: we cannot rely on voluntary commitments from AI labs to ensure safety. Not because the people at those labs are bad actors, but because the structure of competition in a cheap-talk equilibrium makes credible commitment structurally impossible. The incentives to inflate rhetoric beyond behaviour are too strong. The costs of genuine safety investment are too high relative to the costs of talking about safety. The game punishes honesty and rewards strategic positioning.

The self-policing trap isn't a failure of character. It's a Nash equilibrium. And you don't escape Nash equilibria through good intentions, better culture, or more sincere commitment. You escape them by changing the structure of the game itself — altering the payoff matrix so that the behaviour you want becomes the individually rational choice.

That's what well-designed regulation could do — if it's built on game-theoretic foundations rather than naïve trust in what labs tell us about themselves. If it makes safety costly in ways that separate the genuine from the strategic. If it indexes requirements to where the integrity gaps are largest. If it measures actions, not words.

The question isn't whether labs are good or bad. The question is what game we're asking them to play — and whether we've designed that game so that playing it well aligns with actually keeping humanity safe.

Right now, it doesn't. That's the problem we need to solve.


This analysis draws on signalling theory (Spence 1973, Zahavi 1975), mechanism design (Myerson 1981), and empirical observation of AI industry dynamics. The strategic coherence framework and integrity gap formalisation are original contributions. The mathematical framework is falsifiable — if labs don't behave as predicted when positions change, the model is wrong. That's the point.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news

Member discussion