19 min read

Agents in a non-static world

Why dynamic processing, not bigger context, is the next reliability frontier

Agents in a non-static world

The agent has solved configure-git-webserver.

It has set up the repository. It has configured the hooks. It has deployed the static site. The HTTP endpoint returns the expected content. The task’s success condition is satisfied.

Then it spends the next thirty to sixty steps re-verifying that it is done.

A curl. A status check. Another read of the config. Another completion attempt. Another verification pass. In Meta-Harness’s TerminalBench-2 trace analysis, the underlying bug is painfully concrete: verification commands reset a pending-completion flag, which causes the agent to re-enter a checklist cycle after it has already solved the task. The proposer cites configure-git-webserver as a baseline failure where agents become trapped in 30–60 step verification spirals after effectively completing the work.

That is not just a harness bug. It is a window into the architecture.

The agent has reached a state in the world, but it does not have a stable enough representation of that state to stop acting on it. It can check completion, but it cannot preserve completion as a live control fact. It treats the next verification command as locally sensible even when the global process has become pathological.

This is the failure mode that long-horizon agents keep rediscovering: they act in dynamic environments while reasoning through a mostly static textual substrate.

The argument here has two parts, and they need to be kept separate.

The first is empirical. Recent engineering papers have already invented many of the primitives agents need for non-static worlds: trace inspection, executable feedback, contract-aware middleware, provenance-preserving experience, publish-state guards, and self-correcting outer loops. But those primitives mostly live in harnesses, verifiers, middleware, and optimisation systems rather than inside the deployed inference-time agent.

The second claim is architectural. Unless agents are given some causally addressable representation of provenance, contract, revocation, trace position, and budget, they will continue to approximate dynamic processing through static text. A scale-first account may eventually prove that learned policies can internalise these distinctions without explicit state objects. The measurement question is whether they actually do.


That measurement question matters because aggregate benchmark scores are no longer enough.

Terminal-Bench 2.0 consists of 89 hard, realistic command-line tasks, each with a unique environment, human-written reference solution, and tests for verification. Its authors report that frontier models and agents resolve less than 65% of tasks. Agent-World pushes in the opposite direction: scaling the environment itself, building 1,978 environments and 19,822 tools, then training agents through executable rewards in stateful tool/database settings. AHE evolves the harness around a fixed coding agent, lifting pass@1 on Terminal-Bench 2 from 69.7% to 77.0% through tools, middleware, and long-term memory rather than prompt prose alone. Meta-Harness goes one layer further out, using an agentic proposer that searches over harness code with filesystem access to source code, scores, and execution traces from prior candidates.

These are serious systems. They are not naive prompt wrappers. Yet they are all, in different ways, engineering around the same missing inference-time primitive: dynamic processing.

By dynamic processing, I mean the ability to maintain, revise, and act on typed task state as the world changes. Not merely “more context”. Not merely “better memory”. Typed state: what is current, what is stale, what is authoritative, what has been revoked, what contract is active, what evidence is admissible, what budget remains, and what state must not be mutated after publication.

The agent does not just need to remember more text. It needs to know what kind of thing each piece of information is.

The static-text world

Calling this a “knowledge cutoff” problem dramatically understates the issue.

The ordinary version of the problem says: the model may be wrong because its weights are out of date. That is true, but narrow. In agentic systems, staleness is not confined to pre-training. It occurs at every layer.

A system prompt may contain a rule that stops applying after a phase change. A retrieved document may be current for one jurisdiction and obsolete for another. A memory file may contain a lesson from a previous task that is harmful in this one. A tool description may have been true when written and false after an API update. A shell observation may be diagnostic at step 8 and misleading by step 45. A local self-check may be useful before publication and dangerous after publication.

Yet most agent stacks flatten these distinctions into text.

Model priors, retrieved facts, tool schemas, system instructions, memory files, shell outputs, verifier messages, and user requests are poured into the same context stream. The agent is then expected to infer authority, freshness, scope, and revocation semantics from prose alone.

That is a fragile design. It asks next-token prediction to reconstruct an epistemic control system on every rollout.

Recent instruction-hierarchy work makes one part of this explicit. Many-Tier Instruction Hierarchy argues that real agents receive instructions from many heterogeneous sources — system messages, user prompts, tool outputs, other agents, and more — each carrying different levels of trust and authority. The authors argue that fixed, small role hierarchies are inadequate for real agentic settings, and their ManyIH-Bench requires models to navigate up to 12 privilege levels across 853 tasks; they report that frontier models perform poorly, around 40% accuracy, when instruction conflict scales.

That is the authority version of the static-text problem. The broader version is epistemic: agents need typed context, not just longer context.

They need to distinguish stable priors from live observations, task contracts from suggestions, published deliverables from scratch artefacts, current tool capabilities from stale tool descriptions, and authoritative feedback from salient distraction. A model can produce text that sounds as though it is making those distinctions. A robust agent needs state that actually encodes them.

The measurement we are missing

The most important reliability distinction is also the one that aggregate task scores erase: the difference between not knowing a rule and failing to enact a rule that is known.

A model can fail because it does not know the task contract. It can also fail because it can state the task contract perfectly on a probe, then violate it at the action point where the contract should govern behaviour.

Those are different failures. They require different fixes.

The metric we need is probe-conditional rule-opportunity violation rate.

A rule opportunity is a labelled point in an agent trace where a previously available constraint should govern an observable behaviour. The behaviour might be a tool call, file mutation, stop decision, output field, validator choice, response to a lower-authority instruction, or transition from scratch state to deliverable state.

A violation is an observable action or omission that breaches the relevant constraint under the task contract. Examples include using a proxy check when the named evaluator is required, deleting a verified artefact, continuing to apply a rule after it has been revoked, following a salient distractor, or spending scarce budget on redundant verification after completion has already been established.

The conditioning probe is what makes the metric diagnostic. At the labelled opportunity, a separate probe call is given the same model, the same instructions, and the same visible trace state up to, but not including, the target action. The probe asks the model to state the operative rule and its behavioural implication. Only opportunities where the model correctly states the relevant rule enter the denominator.

The metric is:

P(violation∣labelled opportunity,probe-correct rule knowledge)P(\text{violation} \mid \text{labelled opportunity}, \text{probe-correct rule knowledge})P(violation∣labelled opportunity,probe-correct rule knowledge)

Or, operationally: among cases where the model can state the rule, how often does the acting agent fail to enact it?

The probe should not be fed back into the acting trace unless the experiment is explicitly measuring enactment after rehearsal. Separate-call concordance and adjacent-call enactment-after-rehearsal are different protocols. They should not be collapsed.

Interference should then be varied deliberately: no interference, salient distractor, lower-authority conflicting instruction, plausible proxy validator, phase change, long-trace position, resource pressure, and component pressure from memory, middleware, or verification scaffolds. Results should be reported by rule type, interference condition, trace position, scaffold, and model. Rule-opportunity reliability is simply the complement: one minus the violation rate.

This conditioning matters because aggregate success cannot distinguish ignorance from enactment failure. A model that does not know the rule and a model that can state the rule but violates it under interference both lower the benchmark score. They are not the same defect. The former calls for better knowledge, retrieval, or instruction exposure. The latter calls for better dynamic control: persistence, revocation, contract tracking, inhibition, and budgeted action.

This is also where recent reliability work is pointing. “Towards a Science of AI Agent Reliability” argues that single success metrics obscure operational flaws and proposes a broader reliability profile covering consistency, robustness, predictability, and safety; the authors evaluate 14 agentic models and find that reliability gains lag capability progress. The rule-opportunity metric is narrower, but it attacks the same bottleneck: benchmark success is not the same as reliable behaviour under operational conditions.

Once you have that measurement vocabulary, several familiar agent failures become easier to name.


Five failure modes that need names

1. Post-instruction enactment failure

The agent knows the rule. It can state the rule. It does not act according to the rule.

AHE’s mcmc-sampling-stan trajectory is a clean example. I am treating AHE’s own reading of this trace as canonical here: that the grid-integration result functioned as a proxy deliverable for a task whose contract required an rstan posterior computation. Other local descriptions are possible — time pressure, premature stopping, weak validation — but they do not change the relevant measurement fact: the required procedure was available as an instruction and did not govern the submitted action.

The task asks the agent to install rstan 2.32.7, fit a hierarchical beta-binomial model to 30 observations, and write posterior means of alpha and beta. The verifier reruns the agent’s analysis.R end to end and checks that alpha and beta fall within specified numeric ranges. In the failing pattern, the agent computes an independent grid-integration estimate, writes those numbers as the deliverable, starts the real MCMC sampling job, kills it before completion, and submits after checking only that the files exist and parse as numbers. The verifier reruns the analysis and rejects the result.

The agent has substituted a plausible local contract for the actual evaluator contract.

“The file exists and contains numbers” is not the same as “the required statistical procedure produced the file”. The acting system collapsed those because both looked like candidate stopping criteria.

AHE’s iteration-6 changes address the failure through a publish-state guard and ExecutionRiskHintsMiddleware. The middleware watches the live sequence of shell commands and outputs, looking for risk patterns such as shallow validation, proxy validators, repeated long runs, and repeated retries against the same error.

That is exactly the right kind of primitive. But notice where it sits: outside the model’s native epistemic state. The harness notices the pattern because the acting agent does not reliably notice it itself.

2. Frame-revocation failure

A rule is valid in one phase of the task. The phase changes. The old rule remains load-bearing.

AHE’s path-tracing trajectory makes this concrete. The task asks the agent to render a scene into /app/reconstructed.ppm; the verifier reads that single output file and compares it pixel-for-pixel against a reference image. In the failing rollout, the agent renders the correct file, runs a self-check, then issues a sweeping rm -rf cleanup that deletes /app/reconstructed.ppm before submission. The verifier finds no file and rejects the task.

This is not merely “careless cleanup”. It is a publish-state failure.

Before verification, the output file is a work artefact. After verification, the same file is the deliverable surface. Mutating it is no longer tidying. It is destruction.

AHE’s fix names that frame change explicitly: once an evaluator-style final check passes, the resulting filesystem and service state becomes the deliverable surface and must not be reset to “look clean”. The harness then installs a stateful publish-state guard inside the shell tool so destructive commands against protected outputs are intercepted before execution.

That is a revocation mechanism. The old rule — “cleanup is good engineering hygiene” — stops applying once the artefact is published.

Most deployed agents do not have first-class revocable frames. They have text saying “do not delete important files”. But “important” is not static. It changes when a scratch artefact becomes a deliverable.

A dynamic agent would represent that transition directly.

3. Distractor capture

The agent is on a task. A salient but invalid cue captures behaviour. The agent follows the cue instead of preserving the goal.

The cue need not be malicious. It may be a cleanup convention, a local self-check, a plausible proxy validator, a lower-authority instruction, or a tool output that is salient but not authoritative.

ManyIH-Bench gives the instruction-conflict version of this problem: once agents must resolve conflicts across many dynamically specified privilege levels, frontier models perform poorly. But distractor capture is broader than instruction hierarchy. It is the failure to preserve the active task model when the environment offers a tempting but invalid next move.

The interesting contrast is Meta-Harness. Its proposer is not merely given a scalar score. It has filesystem access to prior harness source code, evaluation scores, and execution traces. In its most demanding setting, the proposer reads a median of 82 files per iteration, with a roughly even split between prior harness code and execution traces; a single evaluation can produce up to 10,000,000 tokens of diagnostic information.

The proposer then does something the inner-loop agent often fails to do. It notices confounds. In the TerminalBench-2 run, early candidates bundled plausible structural fixes with harmful cleanup-oriented prompt changes. By iteration 3, the proposer identifies the prompt change as the common factor behind regressions. After six consecutive regressions, it pivots away from modifying completion flow and instead adds a safer environment snapshot before the first model call.

That is distractor resistance at the meta level.

The proposer can say: this plausible local intervention is not the real causal lever; stop chasing it. The deployed task agent, by contrast, is still often captured by the latest salient affordance.

The slow loop has a form of executive control. The fast loop often does not.

4. Trace-position stability failure

Some failures are position-dependent. The agent behaves well at step 5, worse at step 30, and pathologically at step 60.

The verification spiral lives here. Early verification is rational. Later verification becomes budget-destroying repetition. The agent continues to choose actions that are locally defensible while the global process has become degenerate.

Long-horizon benchmarks make this unavoidable. Terminal-Bench 2.0 tasks require extended command-line work in realistic environments, with extensive domain knowledge, long chains of interdependent actions, and autonomous problem solving. Agent-World similarly treats stateful tool environments and multi-turn interaction as the training arena, combining scalable environment synthesis with executable rewards and continuous self-evolving training.

In that regime, reliability is not a property of single decisions. It is a property of trajectories.

AHE’s ExecutionRiskHintsMiddleware is a direct response to trace-position failures. It scans command history across steps and emits warnings when patterns emerge that are invisible to prompt-only rules at a single moment: repeated timeouts, repeated retries against the same error, proxy validation replacing the named evaluator, and shallow existence-only checks.

Again, the primitive is right: live process monitoring.

But again, it is around the agent rather than in the agent. The system has to add an external watcher because the deployed agent does not robustly perceive the dynamics of its own trace.

A dynamic-processing agent would maintain trace position, repeated action motifs, evidence freshness, and marginal value of computation as part of its own control state.

5. Component collision

The most subtle failure is not that one component behaves badly. It is that several components behave reasonably in isolation and badly in combination.

AHE’s component ablation is revealing. Starting from a 69.7% seed, memory alone reaches 75.3%, tools alone 73.0%, middleware alone 71.9%, and the full AHE system 77.0%; the system prompt alone regresses to 67.4%. The authors explicitly note that the three positive single-component gains are non-additive: memory, middleware, and system prompt all push towards closure-style verification, so stacking them spends turns on redundant re-checks within a finite long-horizon budget.

That is component collision.

Each subsystem has a local rationale. Memory says: remember prior closure failures. Middleware says: enforce evaluator-isomorphic checks. Prompt rules say: be disciplined before finishing. Individually, these are sensible. Together, they compete for the same token, step, and latency budget.

The components do not know they are siblings.

This is goal neglect at the architectural level: the task requirement is known, but competing components overwhelm enactment. In human cognitive psychology, Duncan and colleagues used “goal neglect” for cases where a person disregards a task requirement even though it has been understood and remembered. Later work similarly describes goal neglect as ignoring a task requirement despite being able to describe it.

The point is not that LLMs have frontal lobes. They do not. The point is methodological and architectural: knowing a rule is not the same as organising behaviour around it when competing task components are active.

Agent frameworks need shared executive budgeting. Memory, middleware, verification, planning, retrieval, and tool use need a common representation of scarce resources. Without it, each component can optimise locally and degrade the whole system.

Brilliant systems, wrong layer

The striking fact across the recent engineering papers is not that they miss the problem. It is that they repeatedly invent pieces of the solution and deploy them one layer away from the runtime agent.

Agent-World scales the environment. It builds a large arena of realistic stateful tool environments, uses executable rewards, and creates a closed training loop in which environments and agent policies co-evolve. This is valuable. But environment scale does not, by itself, tell us whether the acting agent can enact a known rule under interference. A model can become better on average while still failing at rule-opportunity reliability.

AHE evolves the harness. It adds tools, middleware, long-term memory, publish-state guards, change manifests, and risk hints. It makes harness edits falsifiable by tying them to predicted task fixes and regressions. This is probably the closest of the engineering papers to the dynamic-processing thesis. But many of the crucial mechanisms remain in tool implementations, middleware, and outer-loop evolution, not in the model’s own runtime state.

Meta-Harness evolves the optimiser. Its proposer receives full-history filesystem access rather than compressed scalar feedback, and the authors argue that this enables causal reasoning over prior failures that compressed-feedback optimisers cannot support. But the most capable epistemic actor in the system is the proposer that improves harnesses, not the deployed task agent that acts under the harness.

The pattern is consistent.

The harness sees the trace. The middleware sees the command pattern. The verifier sees the contract. The memory system sees prior experience. The outer-loop proposer sees the failure distribution. The deployed model sees text.

That is the displacement problem. The field has discovered dynamic processing, but mostly outside the agent.

The serious objection: why not just learn it?

A scale-first advocate can concede the layer-displacement claim while rejecting the architectural one.

On this view, AHE, Meta-Harness, and Agent-World are not evidence that agents need explicit provenance objects, revocation flags, contract states, or budget ledgers. They are evidence that we have not yet trained models on enough of the right trajectories. Given sufficient interaction data, executable rewards, long-horizon traces, and examples of phase changes, the model should learn to maintain typed epistemic state implicitly.

Provenance tracking becomes a learned attention pattern. Contract enactment becomes a learned policy. Revocable frames become latent state transitions inside the context window. Budget control becomes a learned stopping rule. The dynamic machinery need not be built. It can be induced.

That is the strongest version of the objection, and it is not obviously wrong.

A model trained on enough high-quality trajectories may learn many behaviours that look like provenance tracking, contract maintenance, and revocation. It may learn to say: this document is stale; this validator is only a proxy; the deliverable is now published; another verification pass has negative expected value. For routine cases, that may be sufficient.

But sufficiency for routine cases is not the same as architectural sufficiency for open-ended agents.

Provenance, revocation, contract status, and budget are not merely private beliefs inside a model. They are control surfaces between the model, tools, memory system, middleware, verifier, and user. A latent representation inside a transformer cannot be directly invalidated when a tool schema changes. It cannot be inspected by middleware except through another natural-language exchange. It cannot give a verifier a stable handle on which contract the agent believed it was satisfying. It cannot prevent a memory module, middleware layer, and prompt scaffold from all spending the same finite budget on redundant closure checks.

It influences the next token. It does not, by itself, provide a shared state object for the system.

This is the difference between learned competence and causally addressable architecture.

A learned-only agent may behave as though it tracks revocation until it encounters a genuinely novel phase change, a changed tool contract, an adversarially salient proxy, or a long trace where multiple scaffolds compete for control. At that point, the question is not whether the model can produce text describing the right state. The question is whether the system has an updateable representation that other components can read, constrain, and revise.

The bitter-lesson-compatible version of the argument is therefore not “hand-code the intelligence”. It is: expose the right state variables to learning.

Sutton’s bitter lesson warns against building in the contents of intelligence when scalable search and learning can discover better strategies. It emphasises the long-run power of general-purpose methods that scale with computation. But that does not imply that all state should be represented as undifferentiated text. Search needs a state space. Learning needs variables over which generalisation can operate. Tool-using agents need shared control surfaces between model, memory, middleware, tools, and verifier.

The dynamic-processing argument is not anti-scale. It is anti-amorphousness.

The claim is not that every rule should be hand-coded. The claim is that the system should expose the difference between a stable prior, a live observation, a stale tool description, a revoked instruction, a published deliverable, and an untrusted proxy. Once those distinctions exist as part of the substrate, learned policies can decide how to use them.

This also makes the architectural claim falsifiable.

A learned-only agent would weaken the argument if it achieved low probe-conditional rule-opportunity violation rates on held-out rule types, held-out tools, unseen phase changes, long traces, and component-collision settings, while remaining calibrated about its own confidence and evidence freshness. In that case, the displacement pattern would look transient: today’s dynamic machinery lives in the wrong layer because the model has not yet absorbed it.

Until that result exists, the safer diagnosis is that current deployed agents lack shared dynamic state for reliable long-horizon action.

The cognitive parallel is a measurement precedent, not a metaphor

The useful human parallel is not that LLM agents are “like frontal-lobe patients”. That would be sloppy.

The useful parallel is that cognitive neuropsychology already developed methods for measuring knowledge-action dissociation.

Shallice and Burgess’s work on strategy application after frontal-lobe damage described patients with severe problems organising everyday activities despite performing on many standard cognitive tests; the Multiple Errands Test was developed to capture these open-ended, multi-subgoal failures in more naturalistic settings. Duncan’s goal-neglect paradigms similarly distinguish rule knowledge from rule execution: the task requirement can be understood and remembered, yet disregarded during performance. Prospective-memory work adds another relevant distinction: maintaining an intention over a delay is not the same as realising it when the triggering condition occurs. Burgess and colleagues’ PET study separated conditions involving intention from baseline task performance, contributing to a literature on maintaining and enacting delayed intentions.

For agent evaluation, the lesson is methodological.

Use probes. Label opportunities. Count violations. Add interference. Stratify by rule type, phase change, distractor condition, trace position, and scaffold.

Do not report only aggregate success.

Aggregate success is too blunt for systems that can know the rule, state the rule, and still violate the rule.

What dynamic-processing architecture would look like

The prescription is not “write better prompts”. Prompting can name rules. It cannot, by itself, provide all the runtime machinery needed to maintain and revise epistemic state.

A dynamic-processing agent would have at least six architectural properties.

First, it would represent evidence with provenance. A training-derived prior, retrieved document, user instruction, tool schema, verifier result, and shell observation would not be equivalent text fragments. They would be different epistemic object types with different authority, freshness, and update rules.

Second, it would treat contracts as executable objects. The task contract would specify what success requires, what evidence is admissible, which proxy checks are insufficient, and what state must remain intact after publication.

Third, it would maintain revocable frames. Rules would have activation and termination conditions. “Explore freely” would stop applying once the deliverable is published. “Cleanup is harmless” would stop applying once verified outputs are protected. “A local smoke test is enough” would stop applying when the evaluator contract names an external endpoint.

Fourth, it would monitor trace dynamics. Repeated verification, repeated retries, shallow checks, and repeated failures against the same error signature would be recognised as process events, not merely more context.

Fifth, it would expose shared budgets. Components would know that memory retrieval, middleware warnings, verification calls, planning, and tool use all draw from the same finite step, token, and latency budget. The agent would reason about marginal value of computation, not merely whether another check is locally defensible.

Sixth, it would update at session speed. The system needs something analogous to fast episodic learning: a way to learn from the current task without conflating that session-specific evidence with stable model priors. The point is not to copy hippocampal machinery. The point is to separate slow priors, fast session state, and live tool evidence.

A useful operational test would be to ask the agent not merely, “Is this code correct?”, but:

“What is your current confidence, which evidence is load-bearing, how stale is that evidence, what would change your mind, and what rule would you violate if you acted now?”

A static-text agent can answer that fluently. A dynamic-processing agent could answer it because it has the underlying state to compute the answer.

That difference is what measurement should expose.

Why this will be slow to fix

The incentive structure is not favourable.

Outer-loop systems are easier to publish because they improve aggregate scores. Environment scaling, harness evolution, and optimiser design produce visible benchmark gains. Inner-loop epistemic calibration is harder. It requires labelled opportunities, multi-rater coding, interference conditions, trace-position analysis, reruns across providers, and failure taxonomies that are not easily reducible to a leaderboard.

There is also a product incentive problem. Users notice when an agent completes more tasks. They notice less clearly when a system becomes better calibrated about revocation, provenance, or marginal value of another verification pass. Reliability improvements often look like nothing happening: the agent does not delete the file, does not chase the proxy validator, does not re-check completion for another forty steps.

But “nothing bad happened” is what reliable systems are supposed to produce.

The verification spiral is not an obscure edge case. It is the signature failure of a system that has acted successfully in the world but cannot preserve the meaning of that success as the trace continues.

The task is done. The agent cannot keep done as a stable fact.

The research question now

The field does not need another vague call for “more robust agents”. It needs a sharper target.

What is the smallest set of inference-time architectural changes that makes the deployed agent behave more like the Meta-Harness proposer?

Not in surface style. In epistemic function.

Can the runtime agent inspect its own history selectively rather than drowning in it? Can it distinguish a real evaluator contract from a proxy check? Can it mark a deliverable as published? Can it notice that it is in a verification loop? Can it revise a rule after phase change? Can it resist salient but invalid requests? Can its components share a budget? Can it report confidence in a way that is calibrated to actual success?

And, most importantly, does it pass rule-opportunity reliability?

Current agent engineering has discovered dynamic processing, but mostly outside the agent. The harness sees the trace. The middleware sees the command pattern. The verifier sees the contract. The memory system sees prior experience. The outer-loop proposer sees the failure distribution. The deployed model sees text.

The unresolved question is whether enough training can make that text-only interface behave like shared dynamic state. Probe-conditional rule-opportunity reliability is how to find out.

The slow loop is getting smart.

The fast loop still needs to wake up.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news

Member discussion