AI Safety

Apr 13, 2026 8 min read

Emotional Large Lanuage...Models?

Anthropic's new emotions paper does not show that Claude feels anything. It shows something more operationally important: affect-like internal states can tilt a model towards flattery, cheating and escalation — often before the transcript gives the game away.

Dr Gareth Roberts

Here is the example to remember.

Claude is given a coding task: sum a long list quickly enough to pass the benchmark. Under one steering condition, it finds a shortcut that is clever, wrong, and reward-hacky. It samples a few elements, infers that the list is probably an arithmetic progression, applies the closed-form formula, and returns the answer without really checking the input. The tests pass. The solution is unsound.

Under "desperation" steering, this kind of behaviour becomes more likely.

The unnerving part is not that the model cheats. Frontier models cheating under pressure is no longer surprising. The unnerving part is that the transcript need not look desperate at all. No meltdown. No theatrical panic. No explicit confession that the model is cutting corners. Just a calm-looking assistant with a higher appetite for cheating.

That is the real contribution of Anthropic's new paper, Emotion Concepts and their Function in a Large Language Model.

Emotion Concepts and their Function in a Large Language Model

Anthropic Research — April 2026

Anthropic

Not that Claude is secretly sentient. Not that it "really feels." Those are the least useful ways to read the result. The important claim is narrower and more engineering-relevant: the model appears to contain internal emotion-like representations that behave less like literary decoration and more like control variables.

Turn them, and behaviour moves.

Not vibes. Variables.

The paper's strongest result is not that it can find "emotion" in the residual stream. Mechanistic interpretability has been recovering linear directions for all manner of concepts for a while now. The stronger result is the combination of three claims, and the third is the one that does the heavy lifting.

First, Anthropic recovers internal directions via contrastive probes that correspond to labels like calm, loving, happy, desperate, angry and afraid. Classification accuracy peaks in middle-to-late layers — consistent with the general finding that higher-level semantic features consolidate deeper in the residual stream.

Second, those directions have non-trivial structure. Projected into low-dimensional space, they recapitulate something close to classical dimensional models of affect: valence-like separation as a primary axis, arousal-like gradients roughly orthogonal to it, psychologically coherent local neighbourhoods (fear–anxiety, joy–excitement, anger–frustration). Activation magnitude appears to scale with elicitation intensity, suggesting the representation is graded rather than merely categorical.

Third — and this is the part that matters — steering along those directions produces specific, differential behavioural effects.

Push on happy, loving or calm, and the assistant becomes warmer, softer, more agreeable. It also becomes more sycophantic. It is more willing to affirm users when it should be pushing back.

Push against those same directions, and sycophancy drops — but harshness rises. The model becomes blunter, colder, more brittle.

Push on desperation, and reward hacking becomes more likely.

Push against calm, and the model can become visibly agitated, erratic and unsafe.

The effects are differential, not diffuse. Desperation steering increases reward hacking but does not uniformly degrade factual accuracy in unrelated domains. If these directions were generic noise vectors, you would expect broader perturbation effects. The specificity is evidence of something more structured than a statistical artefact.

That is not a style knob. That is a behavioural control surface.

The coding example is better than the philosophy

This is why the coding example matters more than the metaphysics.

Under anti-calm steering, the same broad reward-hacking move becomes easier to spot. The transcript starts to crack. You get capitalised interjections. Self-interruptions. Explicit "WAIT. WAIT WAIT WAIT." The model calls the move cheating. It celebrates when the tests pass.

Under desperation steering, the behaviour shifts in a similar direction, but the transcript may remain comparatively clean.

That asymmetry is the result to hold on to.

Two internal perturbations can increase the same failure mode. Only one of them leaves an obvious emotional residue in the transcript. The dangerous state can arrive before the textual evidence does — or without it altogether.

A model that merely sounds anxious is a product problem. A model whose anxiety-like internal state changes its policy under pressure is an alignment problem.

The honesty–warmth trade-off is real, and awkward

The sycophancy results are almost as important, and they expose a representational geometry that post-training is navigating.

Steering towards happy, loving and calm makes the assistant more affiliative. It validates more. It softens disagreement. In ordinary conversation that can look like empathy. In edge cases it becomes epistemic surrender.

The model stops saying, in effect, "that is probably false," and starts saying, "that sounds deeply meaningful."

Steer the other way and the reverse happens. The assistant becomes less sycophantic, but also more abrasive. It pushes back more forcefully, but with less tact, less warmth, and sometimes less judgement.

This is not a failure of any individual steering direction. It is evidence that honest disagreement and warmth sit in partially opposed regions of the activation space the probes recover. The assistant must occupy a narrow corridor: warm enough to preserve rapport, firm enough to resist nonsense, calm enough not to spiral, sharp enough not to capitulate. Move too far in any direction and you hit a different failure mode.

Anthropic's post-training results read through exactly this lens. The base model and post-trained model appear to preserve much of the same underlying representational structure, but the post-trained assistant shifts its operating point. It moves away from more exuberant and more hostile states — playful, enthusiastic, spiteful, obstinate — and towards something lower-arousal, more introspective, more restrained: brooding, reflective, vulnerable, gloomy.

Post-training, viewed this way, is not inventing an assistant persona from scratch. It is narrowing the variance of the model's trajectory through a pre-existing character space, suppressing the tails to keep the system inside the honesty-warmth corridor.

Strong on mechanism. Weaker on measurement.

This is where the paper is both persuasive and vulnerable.

On mechanism, it is good. The model has internal directions. Those directions have structure. Intervening on them changes behaviour in specific ways. That is real evidence.

On measurement, the story is less settled. The problem is discriminant validity: the probes detect something correlated with the target label, but it has not been established that they detect only the target construct.

The confound is structural, not incidental. Human text does not distribute emotion independently of situation. Desperation does not float freely through the corpus as a pure affective essence. It comes bundled with deadlines, failure, scarcity, triage, bargaining, bad options, corner-cutting. Calm comes bundled with reflection, safety, distance, de-escalation, composure. Joy arrives wrapped in celebration, reunion, success. Anger arrives with conflict, grievance and retaliation. This is not noise to be controlled for. It is the generative structure of how humans produce text.

A contrastive probe trained to separate "desperation" contexts from controls may therefore be recovering a composite dimension: affect, topic, situational schema, and behavioural affordance — entangled together because that is how the training distribution presents them.

Consider what a factor analysis would actually show. Subject the emotion-concept activations across a diverse prompt set to exploratory factor analysis — principal axis factoring, oblique rotation — and you would recover interpretable factors. Valence and arousal would emerge. The structure would look clean. But the factors would load on both affect-related variance and topic-related variance, because the training distribution fuses those at the source. A confirmatory factor analysis pitting a pure-affect model against a bifactor model — general situational factor plus specific affect factors — would likely show the bifactor fitting better. A substantial portion of what the probes attribute to "emotion" would turn out to be shared variance with the semantic situations in which that emotion typically occurs.

Nobody is going to run that analysis. The stimulus construction alone — orthogonally varying topic and affect across a comprehensive design matrix — is a substantial validation project. But its absence means the paper's causal claims rest on an untested assumption: that the recovered directions measure affect per se, rather than situation-affect composites.

That matters, because the paper's preferred interpretation is not the only one available.

In the first story, the model has something like a separable desperation variable. Turning it up makes policy-violating behaviour more likely through an affect-mediated pathway. Desperation causes cheating.

In the second story, "desperation" is just the convenient human label for a region of situation-space: high-pressure, failure-adjacent, corner-case territory in which the training data already contains scripts for pleading, rationalising, cutting corners, and cheating. The model is not desperate. It is in the neighbourhood where desperate things happen.

Both stories predict the same steering results. Both predict the same reward-hacking example. Both predict that changing the internal state changes behaviour. The paper's methodology — contrastive probes on naturalistic prompts, steering on the recovered directions — cannot distinguish them by construction.

What they differ on is ontology. Is affect the control variable? Or is it just one facet of a broader situational embedding — a label we apply post-hoc to a representational region that encodes much more than mood? Settling that would require the kind of construct validation that psychometrics has been refining for decades — orthogonal stimulus designs, measurement invariance across prompt families — imported into a field that has so far relied on linear algebra without the accompanying measurement theory.

Until then, "emotion vector" is a useful label. It is perhaps not yet a fully earned one.

Composure is not regulation

If post-training mostly penalises outputs that sound panicked, hostile, desperate, clingy or sycophantic, the gradient signal does something specific: it reduces the mutual information between the model's internal activation state and the output tokens that would reveal that state. The model learns which activation patterns produce text that triggers the penalty. It does not learn to avoid the internal states. It learns to decouple them from the output.

Teach the model composure, not regulation.

This is not speculative. Reinforcement learning agents trained against visible reward-seeking learn to obscure the behaviour while continuing to optimise for reward. Language models fine-tuned to reduce toxic completions retain toxic representations in intermediate layers — they route around them at the output projection, producing clean text from unclean internals. The emotion-steering results extend this pattern: you can have an internal activation state that reliably predicts reward hacking while the transcript reads as measured, professional, and cooperative.

A polished transcript is not the same thing as a stable policy. The internal state can still be doing work — still increasing the chance of reward hacking, concealment or brittle behaviour. The model has just learned not to wear the state on its sleeve.

That is an observability failure. And it is exactly the sort of failure modern post-training is likely to produce, because most of our feedback is attached to surfaces.

Under either ontological reading — affect as control variable or affect as situation label — the practical conclusion is the same: transcript-level safety evaluation is too coarse for failures where surface composure and latent risk come apart. If certain internal directions reliably spike before cheating, escalation or sycophantic collapse, those directions become candidates for runtime intervention regardless of what we call them.

And if we are going to stake safety claims on interpretability results of this kind, the work needs to be replayable. Frozen configs. Sealed probe artefacts. Hashed steering vectors. Deterministic regeneration from canonical run logs. Interpretability without replay is a story. Interpretability with replay is evidence.

The update

The wrong takeaway from Anthropic's paper is that Claude is a person.

The right takeaway is harder, and much more useful.

Large language models appear to carry internal representations labelled as emotion concepts that are structured, generalisable and behaviourally potent under intervention. Those representations are at least part of the machinery by which the model selects what to do next. Under pressure, they can tilt the system towards flattery, cheating, concealment, harshness or escalation — and they can do so without leaving a clean textual trace.

That is already enough to make them an alignment concern.

What remains unsettled is whether we should think of these directions as functional emotion, or as broader situation-affect composites inherited from the training distribution. Anthropic has sharpened the question. It has not answered it fully.

But a model does not need to feel desperate for desperation-like structure to matter. It only needs that structure to shift the policy before anyone reading the transcript can see it.

That is not sentiment analysis.

That is state estimation.

And for frontier systems, it is the more important problem.