context engineering

Feb 16, 2026 10 min read

The Context Wars

The quiet discipline that will determine whether AI tells the truth.

Sometime last year, an AI system deployed by a mid-sized insurance firm told a customer that their claim was covered under a policy that had been superseded eighteen months earlier. The system had done everything it was supposed to do. It retrieved the relevant documents. It followed its instructions. It cited its sources. And it was wrong — not because the model was stupid, but because the pipeline that assembled its reading material had handed it two versions of the same policy, one current and one defunct, and the model, diligent and literal, had drawn from the wrong one.

No one called this a context engineering failure, because almost no one was using that phrase yet. They called it a hallucination, filed a bug report, and moved on.
But it wasn't a hallucination. The model hadn't invented anything. It had been given contradictory evidence and made a choice. The failure wasn't in the model's reasoning. It was in what the model was given to reason about — in the curation of its reality.

This is the territory that a growing number of practitioners now call context engineering, and if the phrase sounds like another Silicon Valley rebrand, that reaction is understandable. The field has a branding problem. Every eighteen months, someone coins a term for the thing that was supposed to be the last term. Prompt engineering. Retrieval-augmented generation. Agentic architectures. Each phrase captures something real and then gets inflated until it means everything and therefore nothing. Context engineering risks the same fate.

But the underlying problem it names is not going away. It is, in fact, getting worse. And if we don't develop a serious discipline around it, the consequences will extend far beyond insurance claims.

The Road, Not the Steering Wheel

To understand what context engineering actually is, open a production AI system — not a chatbot, but a real one, the kind that answers questions about regulations or processes internal documentation — and look at what gets sent to the model on each request.

The user's question is a sliver. A few dozen words, maybe a hundred. Around it, assembled programmatically, is everything else: retrieved documents, conversation history, tool outputs, system-level constraints, structured data pulled from databases, metadata about sources and versions. The instruction — the part that tells the model how to behave — is typically a few hundred tokens. The context surrounding it can be thousands.

Here's the image that clarifies the relationship: the instruction is the steering wheel. The context is the road. You can have the best steering in the world, but if the road is full of potholes, wrong turns, and signs pointing in contradictory directions, you're going to crash.

Prompt engineering — the craft of writing better instructions — is the discipline of building a better steering wheel. It matters. It will continue to matter. But in most production systems, the instruction is rarely the thing that determines whether the output is correct. The context is. What documents were retrieved. Whether they're current. Whether they contradict each other. Whether the critical exception clause survived the summarisation step. Whether the tool output was parsed or dumped raw.

Context engineering is the discipline of building a better road.

The Compression Problem

The deepest insight in the emerging context engineering literature — and it is an insight, not a rebranding — is that building context is fundamentally an act of compression.

The world is large. A company's policy manual runs to hundreds of pages. Its ticket history stretches back years. Its regulatory environment spans multiple jurisdictions, each with its own effective dates and supersession chains. The context window of even the most capable model is finite. You cannot fit the world in.

So you compress. You take the messy, sprawling, contradictory reality of an organisation's knowledge and you squeeze it down into a few thousand tokens — a tiny aperture through which the model must view the world and make a decision.
This isn't a metaphor. It's a constraint with mathematical teeth.
Every step in a modern retrieval pipeline is a compression operation, and each one discards information. PDFs become plain text; tables and layout cues vanish. Documents become chunks and embeddings; time, versioning, and authority disappear unless someone explicitly encodes them. A search query returns the top-k results; everything else is treated as irrelevant. Those results get reranked, deduplicated, truncated to fit the budget. By the time the model sees the context, it has passed through half a dozen lossy transforms, each one a place where critical information can quietly disappear.

Consider a single sentence from an expense policy: *"Meals may be expensed up to $50 per day, except during international travel, where the limit is $120 per day."*

A summary that compresses this to *"Meals can be expensed up to $50 per day"* hasn't produced a less detailed version. It has produced a wrong version. The exception — the word *except*, followed by the condition and the different number — is the information that changes the decision. A system that drops it will give the wrong answer to every international traveller who asks.

This is what practitioners mean when they talk about *lossless anchors*: the pieces of information that must survive compression intact. Numbers. Units. Negations. Effective dates. The words *unless*, *except*, *must not*. These are the load-bearing elements of a document, and most compression pipelines have no mechanism for identifying or preserving them. They treat all tokens as equally compressible, because that's what the default tooling assumes.

The Sceptic's Case

There is a reasonable objection to all of this, and it goes something like this: context windows are getting enormous. Gemini handles millions of tokens. Claude processes hundreds of thousands. If the window is big enough, why compress at all? Just paste everything in and let the model sort it out.
This is the most seductive wrong answer in the field.

It's wrong for two reasons, one obvious and one subtle.

The obvious reason is cost and latency. Filling a million-token window on every API call is expensive and slow. But this is an engineering trade-off, not a fundamental constraint. Costs fall. Latency improves. If this were the only objection, the sceptics would eventually be right.

The subtle reason is the one that matters: a model that actually uses more of its context becomes *more* sensitive to the quality of that context, not less.
Think about it this way. A weak model with a short context window is like a student who reads only the first paragraph of each source. If you hand that student a contradictory document, they'll probably ignore it — they weren't going to read that far anyway. A strong model with a long context window is like a meticulous researcher who reads everything you give them. Hand *that* researcher contradictory documents, and they won't ignore the contradiction. They'll engage with it. They'll try to reconcile it. And if the contradiction is between a current policy and a superseded one, they may choose the wrong side — not because they're careless, but because you gave them no signal about which document should win.

The failure mode migrates. With weaker models, the risk was that context got ignored. With stronger models, the risk is that bad context gets faithfully obeyed. Better models don't solve context quality problems. They amplify the consequences of getting context wrong.

This is the argument that should keep AI engineers up at night. Every improvement in model capability is also an improvement in the model's ability to follow bad instructions, draw from stale sources, and act on contradictory evidence — if that's what the context contains.

What the Model Remembers (And Why It Matters)

There is a further wrinkle, one that turns a straightforward engineering problem into something genuinely hard.

Large language models don't arrive empty. They arrive with knowledge — vast, detailed, and frozen at the moment their training data was cut. Ask a model about a regulation that was updated last year, and it may answer confidently from its training data, citing the old version as if it were still in force. This isn't a bug. It's the expected behaviour of a system whose knowledge is parametric — baked into its weights, not looked up in real time.

The purpose of context, in this regime, is not merely to inform. It is to *override*. You are presenting evidence that contradicts the model's built-in beliefs, and you are asking it to update.

Research on knowledge conflicts — the technical literature uses phrases like *parametric-contextual conflicts* — suggests that models resolve these inconsistently. Sometimes the context wins. Sometimes the training data wins. The outcome depends on factors that are difficult to predict or control: how strongly the model encoded the original information, how similar the new information's vocabulary is to the old, how the context is positioned in the window.
This is where a technique called *override context* becomes important — and where the line between prompt engineering and context engineering gets interestingly blurry.

Override context is not a description of the new policy. It's a description of the *change*. It says: *"As of July 2025, Standard X replaced Standard Y. Obligations under the prior framework are no longer in effect. Where this document conflicts with prior information, this document is authoritative."*
That paragraph carries almost no positive information about what the new standard requires. What it carries is a targeted contradiction of the model's prior knowledge. It names the old framework. It states that it no longer applies. It establishes the authority chain. It is context designed not to inform the model, but to compete with what the model already believes.

Is this prompt engineering? In a narrow sense, yes — you're crafting text that shapes the model's behaviour. But calling it prompt engineering misses the point in the same way that calling a GPS a "map" misses the point. The GPS doesn't just show you where things are. It shows you where things are *relative to where you are right now, in real time, with traffic and road closures and construction zones taken into account*. Override context is navigational. It's not about the destination. It's about the gap between where the model thinks it is and where reality actually is.

The striking thing is how few production systems do this. When a document is updated in a knowledge base, the old version is quietly replaced or, worse, left alongside the new one. Change summaries aren't generated. Supersession notices aren't indexed. The system asks *what's relevant to this query?* but never asks *what might the model get wrong about this query without intervention?*
That second question is where the real engineering begins.

Context as an Alignment Surface

There is a deeper implication here, one that reaches beyond information retrieval into questions about safety, trust, and institutional accountability.

When you construct a context window, you are shaping the model's effective behaviour for that single inference call. Not its weights. Not its training. Its *situational behaviour* — which documents it sees, what it treats as authoritative, what constraints it operates under. This is alignment through information design, not alignment through training. It is temporary, specific, and enormously powerful.

Most AI security discussions focus on the instruction layer: preventing prompt injection, filtering dangerous outputs, establishing instruction hierarchies. These matter. But if context composition is the dominant variable in output quality — and in knowledge-intensive systems, it is — then the highest-leverage security surface is the context pipeline itself.

Consider: a context window can contain zero malicious inputs and still produce harmful outputs. If individually legitimate but mutually contradictory documents create an incoherent knowledge state, the model will draw conclusions from that incoherence. Input filtering won't catch this, because no individual input is malicious. The failure is in the *composition* — in the relationship between the pieces, not in the pieces themselves.

This is the kind of failure that scales. An individual wrong answer about an expense policy is a nuisance. An individual wrong answer about a regulatory obligation, a medical guideline, or a legal constraint can be consequential. A *systematic* pattern of wrong answers — caused not by a flawed model but by a flawed context pipeline — can undermine the institutional trust on which AI adoption depends.

And this is ultimately what makes context engineering a discipline rather than a technique. It's not just about getting better answers to individual queries. It's about building systems whose relationship to truth is reliable, auditable, and maintainable over time. Systems where you can point to the context that produced a decision and verify that it was current, authoritative, coherent, and complete. Systems where changing the retrieval strategy or the summarisation step triggers a regression test, because you understand that you've changed the codec through which reality reaches the model.

The Discipline We Don't Yet Have

Context engineering, as a discipline, is early. The term is established. The core observation — that context assembly has more leverage than instruction design for most knowledge-intensive tasks — is broadly accepted among practitioners who build production systems. But the tooling is immature. The theory is fragmented. There is no standard way to measure the distortion introduced by a retrieval pipeline. There is no standard way to introspect what a model's parametric knowledge contains for a given topic. There is no standard way to verify that a context window is internally coherent.

The sceptics are right that much of this overlaps with existing retrieval and data-engineering practice. They are right that the compression metaphor, taken too literally, can mislead. They are right that there is a risk of terminological inflation — of dressing up ordinary pipeline work in the language of information theory to make it sound more profound than it is.

But the sceptics are wrong about what follows from these observations. The fact that context engineering draws on existing disciplines doesn't mean it isn't its own discipline. Software engineering draws on mathematics, electrical engineering, and management science; that doesn't make it a rebrand of any of them. The question isn't whether context engineering introduces entirely novel techniques. The question is whether it names a *design surface* that deserves its own attention, its own tooling, its own evaluation frameworks, and its own expertise. The answer, increasingly, is yes.

The alternative is to continue treating context as a commodity — as something that gets assembled by a retrieval pipeline and dumped into a prompt, with quality assessed by vibes and post-hoc debugging. This is where most of the industry still is. It is not where the industry can afford to stay.

Coda: The Editor's Eye

There's an old principle in journalism: the quality of an article depends less on what the writer includes than on what the editor cuts.

A good editor doesn't just remove bad sentences. They remove good sentences that don't serve the piece — sentences that are true, well-crafted, and interesting, but that dilute the argument, distract the reader, or introduce ambiguity where clarity is needed. The editor's job is compression. Not compression in the mechanical sense of making things shorter, but compression in the editorial sense of making every remaining word do maximum work.

Context engineering is the editor's discipline applied to machines. It asks: given everything the model could see, what *should* it see? What serves the decision at hand? What introduces noise? What contradicts? What is stale? What is authoritative?

The models are getting better. They can read more, synthesise more, reason more. But reading more has never been a substitute for reading the right things. The most well-read person in the room is not necessarily the wisest — not if half their sources are outdated, a quarter contradict each other, and the most important one is buried at the bottom of the pile.

The context window is not a filing cabinet. It is a editorial page. And like all editorial decisions, what you choose to leave out will matter more than what you choose to put in.

We are building machines that make decisions by reading. The least we can do is take seriously what we give them to read.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news