13 min read

The Architecture Thesis: What Neuroscience Knows That AI Ignores

Why the next leap in artificial intelligence will come from structure, not scale — and why a century of brain science has already drawn the blueprint.

The Architecture Thesis: What Neuroscience Knows That AI Ignores

I. The Wrong Race

There is a fight happening in artificial intelligence, and most people are watching the wrong part of it.

The headlines track parameter counts. Who has the biggest model. Who spent the most on compute. Which GPU cluster drew enough power to brown out a small city. And to be fair, scale has delivered. GPT-4 is genuinely more capable than GPT-3. Claude has improved meaningfully between generations. The scaling laws described by Kaplan et al. (2020) and refined by Hoffmann et al. (2022) in the Chinchilla work are real, well-documented, and have guided billions of dollars of investment.

But something has shifted. On specific capability dimensions — novel reasoning, out-of-distribution generalisation, genuine compositional understanding — the returns from additional scale are flattening. Not because we've exhausted what's possible, but because we may be exhausting what's possible with this architecture alone.

That is a different and more interesting claim than "AI has plateaued." It hasn't. But the path to the next meaningful leap may not run through bigger models. It may run through better ones.

For a long time, we've assumed that bigger equals better when it comes to intelligence. It's not entirely wrong — brain volume correlates with IQ, though less strongly than you'd think. Several meta-analyses place it around r = 0.3–0.4. But if more neurons were all that mattered, whales and elephants would be writing philosophy books instead of us. Correct for body mass, and we'd all be living under the dominion of our shrew overlords.

Size isn't the answer. Architecture is. And the only system that has demonstrably solved general intelligence — the human brain — has been trying to tell us this for a century.


II. What 1.4 Kilograms Knows

Your brain runs on roughly 20 watts. A frontier model running inference on a modern data centre consumes megawatts. That's an energy efficiency gap of several orders of magnitude for systems that, on many cognitive benchmarks, perform comparably.

A common objection is that the fair comparison is energy per cognitive operation, not total system draw, since data centres serve thousands of simultaneous requests. Fair enough. Adjust for that, and the gap narrows to perhaps two or three orders of magnitude. But two or three orders of magnitude is still a staggering deficit. If a competitor achieved comparable performance at 100–1000x lower energy cost, you wouldn't celebrate the narrowed gap. You'd conclude your engineering approach has a fundamental problem.

And the "evolution had 600 million years" rebuttal cuts both ways. The brain had more optimisation time, certainly. But it also had far more constraints: electrochemical signalling speeds six orders of magnitude slower than electronic circuits, unreliable components, a wet and temperature-sensitive substrate, powered by glucose metabolism. Silicon has every physical advantage except architecture. If you're still less efficient by orders of magnitude with a faster, more reliable, more controllable substrate, your architecture is the bottleneck — not your compute budget.

This isn't a fun fact. It's an engineering indictment. And the neuroscience explains exactly why.

The Dimensionality Trick

The lateral prefrontal cortex contains neurons with a property that should make every AI researcher sit up: mixed selectivity. Rather than responding to single features — this neuron fires for "red," that one for "loud" — these neurons respond to complex, nonlinear combinations of features. A single neuron might fire for "red and loud and unexpected" but not for any of those properties in isolation.

This creates extraordinarily high-dimensional representations with relatively few neurons. It's a compression trick that allows the brain to encode a vast space of possible cognitive states efficiently. And it maps directly onto a question that matters in AI: how do you get rich, flexible representations without brute-forcing them through billions of parameters?

Transformer attention heads approximate something similar. Multi-head attention creates different representational subspaces that capture different relational patterns. But there's a crucial difference: in biological mixed selectivity, the combinations are learned through embodied interaction with a structured world. In transformers, they're learned through statistical co-occurrence in text. The representations look similar in form but are grounded in fundamentally different ways.

This distinction — the grounding problem — is not a footnote. It may be the central issue in artificial intelligence. A text-only model has weak grounding: statistical regularities that capture relational structure between concepts without sensory referents. A multimodal model trained on vision, audio, and language has stronger grounding: representations anchored to perceptual features, not just linguistic co-occurrence. A model with tool use and environmental interaction has stronger grounding still: representations shaped by consequences, not just correlations.

We don't need to solve embodied cognition in the philosophical sense to make progress. We need representations anchored to richer, more structured information sources than text alone. This is exactly what multimodal training, world models, and architectures like Yann LeCun's JEPA are pursuing. The neuroscience tells us what "better" looks like: high-dimensional, multimodal, prediction-error-driven, and shaped by interaction rather than observation.

Express Highways

Von Economo neurons — large, spindle-shaped cells found in the anterior cingulate cortex and frontal insula — serve as the brain's long-range integration pathways. They're fast. They connect distant brain regions with minimal synaptic delay. And their density correlates with fluid intelligence.

Here's the part that complicates a simple "humans are special" narrative: VENs are also found in whales, elephants, and great apes. Far from undermining the argument, this strengthens it. Every species with sophisticated social cognition and complex behavioural flexibility has independently evolved dedicated fast-integration pathways. Five lineages separated by tens of millions of years of independent evolution, converging on the same solution. That's not coincidence. That's constraint satisfaction. When the same architectural feature evolves independently multiple times under selection pressure for complex cognition, you're looking at a design principle, not an accident.

But suppose the sceptic persists. Maybe VENs are correlated with complex cognition but not causally involved — a byproduct of brain size or cortical folding. Fine. The functional requirement they fulfil — fast, long-range integration between distant processing regions — is independently supported by lesion studies, structural connectivity analyses, and the consistent finding that white matter tract integrity predicts fluid intelligence. The cells are interesting. The function is what matters for engineering.

Current transformer architectures lack an equivalent. Information flows through uniform attention layers with no dedicated fast-integration channels. Every token attends to every other token with the same mechanism, regardless of whether the task demands rapid global integration or slow local processing. It's as if the brain processed a snap decision about a predator through the same circuitry it uses to compose a sonnet.

Prediction as Architecture

Karl Friston's free energy principle — and the predictive coding framework it supports — offers one account of how the brain achieves its energy efficiency. The core idea: the brain continuously generates predictions about incoming sensory information and only fully processes the prediction errors. Most of the world is predictable, so most of the time, the brain is running on fumes.

Let me be precise about the epistemological status of this claim. Predictive coding as a theory of how the brain actually works is contested. The specific claim that cortical columns implement hierarchical Bayesian inference through reciprocal message-passing is supported by some evidence and contradicted by other evidence. The jury is out.

Predictive coding as an engineering strategy — allocate more computation to unexpected inputs, less to predictable ones — is independently validated. It's the principle behind delta encoding in video compression, anomaly detection in cybersecurity, exception-based processing in event-driven systems, and predictive prefetching in CPU design. You don't need Friston to be right about cortical microcircuits to know that spending equal resources on expected and unexpected inputs is wasteful.

The neuroscience inspired the engineering principle. The engineering principle stands on its own merits. And it's notably absent from current LLM architectures, which process every token with equal computational weight — the word "the" for the millionth time receiving the same resources as a genuinely novel concept.


III. The Pattern Matching Question

Let me address the elephant in the room directly.

The claim that LLMs are "just pattern matching" was reasonable shorthand in 2022. Today, it's reductive to the point of being misleading. Chain-of-thought prompting, o1-style extended reasoning, tool use, and agentic architectures have meaningfully expanded what these systems can do. Dismissing all of that as "just patterns" is like dismissing the brain as "just electrochemistry" — technically true, profoundly unhelpful.

But these advances extend the autoregressive paradigm rather than escape it. This is the critical nuance that gets lost in the discourse.

Chain-of-thought works by generating intermediate tokens that condition subsequent generation. It's still prediction over a learned distribution; it just gives the model more sequential steps. The o1 family trades latency for repeated sampling and verification, which dramatically improves reliability but doesn't change the underlying computation. Tool use is genuinely different — but when a model calls a Python interpreter to do arithmetic, it's the interpreter reasoning, not the model.

None of this means these techniques aren't valuable. They clearly are. The question is whether the autoregressive transformer, however augmented, can reach general intelligence through extensions alone — or whether something structurally different is needed.

Here's one reason to think it can't.


IV. Speed: The Forgotten Variable

The relationship between processing speed and intelligence is one of the most robust and underappreciated findings in cognitive science.

Early studies revealed that the correlation between reaction time and IQ increases as tasks grow more complex. Simple reaction time? Modest correlation. Choice reaction time with two options? Stronger. Four options? Stronger still. Eight? The relationship becomes remarkably tight. Processing speed matters more as cognitive demands rise.

Inspection time tasks make this even clearer. These strip away the motor component entirely — participants simply discriminate between two briefly presented stimuli. No button-pressing speed involved. The correlation between inspection time and fluid intelligence is robust and well-replicated. It's not about reflexes. It's about how rapidly the brain processes information at its most fundamental level.

Now consider how current AI handles increasing complexity. A frontier LLM facing a harder problem doesn't process faster or recruit additional parallel pathways. It generates more tokens sequentially. The o1 approach leans into this — more serial computation for harder problems. It works.

But it's the opposite of what biological intelligence does.

The brain handles complexity through parallel breadth: recruiting more neural populations simultaneously, integrating information across more pathways at once. AI handles it through serial depth: more sequential steps through the same architecture.

A common objection: transformers are parallel. A forward pass computes attention simultaneously across all heads and positions. This is true, and it's irrelevant. That's computational parallelism — parallelism of hardware execution. What I'm describing is cognitive parallelism: the ability to simultaneously maintain and integrate multiple independent processing streams operating at different timescales and levels of abstraction.

The brain doesn't just process inputs in parallel. It runs multiple cognitive operations in parallel: perceptual processing, emotional evaluation, memory retrieval, motor planning, predictive modelling — all simultaneously, all feeding into each other through fast lateral and top-down connections. A transformer generates a single stream of tokens representing a single chain of reasoning. No matter how parallel the internal computation, the cognitive architecture is sequential at the generation level.

Speculative decoding and parallel decoding schemes don't solve this. They're throughput optimisations — speed hacks that predict what the next tokens will be and verify in batch. They don't enable parallel cognition. The distinction matters because it determines what kinds of problems a system can solve efficiently. Serial cognition handles well-defined, decomposable problems. Parallel cognition handles ill-defined, holistic problems that require simultaneously balancing multiple constraints that can't be neatly factored. The latter is where human intelligence excels. It's where LLMs consistently struggle, regardless of scale.


V. The Hybrid Path

If this argument holds, the implication isn't that we should abandon transformers. They're brilliant. The implication is that we need architectures combining different computational strategies, orchestrated intelligently.

The industry is already groping toward this, even if it isn't framing it in neuroscience terms.

What Already Exists

Mixture-of-experts architectures are the most direct existing response to the efficiency problem. Mixtral, Switch Transformer, and reportedly the internals of several frontier models route inputs to specialised sub-networks rather than activating every parameter for every token. It's a primitive form of the brain's strategy of recruiting task-relevant circuits rather than running everything through a general-purpose pipeline. The Chinchilla insight — that most models were over-parameterised and under-trained — pointed in this direction. Sparse activation goes further.

Tool-augmented generation is hybrid architecture in action. When a model calls a code interpreter for arithmetic, queries a search engine for current information, or uses a calculator, it's delegating what it's bad at to purpose-built systems while handling what it's good at — language understanding, task decomposition, output formatting. I made the mistake in earlier thinking of framing this as a failure. It's not. It's evidence that the field is already building specialised, coordinated systems. It needs to go much further.

Agentic frameworks — LangGraph, CrewAI, Semantic Kernel — are building the coordination layer. Multiple models with different capabilities, combined with tools, knowledge graphs, and formal logic systems, orchestrated by planning and routing layers. It's messy and early, but the direction aligns with the thesis.

What's Missing

The fast-integration layer — the VEN equivalent. Current multi-agent and tool-augmented systems coordinate through text: one model generates output, another reads it, decisions flow through sequential message-passing. It's slow, lossy, and scales poorly.

Biological intelligence doesn't work this way. The brain's fast-integration pathways carry compressed, high-bandwidth signals between specialised regions with minimal delay. We need something analogous: a coordination mechanism that operates below the level of full language generation, enabling rapid routing and integration without the overhead of serialising everything through tokens.

Neuromorphic hardware is the most direct attempt. Intel's Loihi 2 and IBM's NorthPole chips support sparse, event-driven, parallel computation — closer to biological neural circuits than GPUs.

The common objection: neuromorphic chips have been around for over a decade and haven't displaced GPUs. This is like asking why electric cars hadn't displaced combustion engines in 2010. The answer is ecosystem, tooling, and investment momentum — not fundamental capability. Neuromorphic chips have demonstrated 100–1000x energy efficiency advantages on specific workloads. They haven't displaced GPUs for LLM training because the entire software stack — CUDA, PyTorch, the transformer itself — was co-designed with GPU parallelism. You don't evaluate a fish by its ability to climb a tree. The relevant question is whether neuromorphic principles can be incorporated into next-generation AI architectures — and the growing body of work on hybrid digital-analog systems suggests they can.

Adaptive computation is the software-level equivalent. If current models process every token with equal weight, future architectures might allocate computation dynamically — more for surprising inputs, less for predictable ones. Early work on adaptive computation time (Graves, 2016) and recent speculative decoding approaches gesture in this direction. We're a long way from anything resembling the brain's predictive efficiency, but the principle is sound and the engineering path is visible.


VI. Addressing the Bitter Lesson

No serious essay arguing that neuroscience should inform AI architecture can avoid engaging with Rich Sutton's "Bitter Lesson" — the argument that methods leveraging raw computation consistently outperform methods leveraging human domain knowledge.

It's an important observation with genuine historical support. And it's wrong as a general principle.

Here's why: every example Sutton cites involves scaling a fixed architecture. Chess engines got better by searching more positions. Speech recognition improved with more data and compute. Computer vision improved with larger networks. In every case, the architecture itself — minimax search, HMMs, CNNs — was designed using human knowledge about the domain or about information processing. The scaling happened within an architecture that human insight created.

The bitter lesson isn't "human knowledge is useless." It's "once you have the right architecture, scale beats hand-crafted features." That's a pro-architecture argument, not an anti-one.

Sutton's own examples prove this. Convolutional neural networks were directly inspired by Hubel and Wiesel's work on the visual cortex — hierarchical feature detection, receptive fields, spatial invariance. That wasn't hand-waving. It was the most productive cross-pollination in the history of machine learning. Attention mechanisms bear more than a passing resemblance to selective attention in cognitive neuroscience. Replay buffers in reinforcement learning were inspired by hippocampal replay during sleep. Dropout mirrors stochastic neural firing. Network pruning mirrors developmental synaptic pruning. Temperature sampling mirrors noise-driven exploration in biological decision-making. Adaptive learning rates serve the functional role of neurochemical modulation — adjusting the gain and routing of information processing based on context.

The architectures the bitter lesson crowd wants to scale came from neuroscience in the first place. The cupboard hasn't merely been opened — it's been supplying the field's foundational innovations for decades. But it's far from empty. Mixed selectivity, predictive coding, fast long-range integration, parallel cognitive streams, energy-proportional computation — these are principles that haven't yet been seriously imported.

The bitter lesson says: find the right architecture, then pour compute into it. I'm arguing we haven't found it yet. These positions are entirely compatible.


VII. The Counterargument I Can't Dismiss

There is one objection to this entire line of reasoning that I want to engage with honestly.

Emergent capabilities.

The observation that new abilities appear unpredictably at scale — that a model trained on 10x more data doesn't just get 10% better but suddenly acquires qualitatively new capabilities — is genuinely challenging. If you can't predict what the next order of magnitude will unlock, arguing for any kind of ceiling is epistemically risky. The history of AI is littered with confident predictions about what neural networks could never do, and those predictions have aged poorly.

I don't think this invalidates the architecture thesis. But it constrains it, and I want to state the constrained version precisely:

Even if emergent capabilities continue to appear at scale, architecture-level improvements informed by neuroscience could dramatically accelerate their arrival, reduce the compute required to reach them, and unlock capabilities that pure scaling might never produce regardless of resources.

Is this weaker than "scaling is dead"? Yes. Is it true? I believe so, and here's the structural reason.

Emergence in neural networks isn't magic. It results from phase transitions in the loss landscape — points where quantitative increases in capability produce qualitative changes in behaviour. These phase transitions are architecture-dependent. Different architectures have different loss landscapes, different critical thresholds, and different emergent behaviours. An architecture better suited to a task will hit its phase transitions earlier — at lower scale.

If the transformer is a suboptimal architecture for general intelligence — and the neuroscience arguments above suggest it may be — then we're climbing the scaling curve of the wrong loss landscape. We might eventually reach impressive emergent capabilities. But a better architecture would reach the same capabilities sooner, cheaper, and more reliably — and might reach capabilities that the transformer's loss landscape simply doesn't contain at any scale.

That's not a hedge. That's a falsifiable prediction: that architectural innovation will deliver capability jumps that equivalent scaling investments in dense autoregressive transformers will not.


VIII. The Thesis

Let me state it plainly, without hedging.

The transformer architecture, for all its brilliance, is a brute-force solution to intelligence. It works by throwing computation at every part of a problem equally, scaling through parameter count and data volume, and hoping that statistical regularities in training data capture enough structure to approximate cognition. It has taken us remarkably far. It will not take us the rest of the way.

The brain is proof that a radically different solution exists. Sparse. Predictive. Parallel. Energy-efficient. With dedicated fast-integration pathways and high-dimensional mixed representations that emerge from structured interaction with the world.

Every engineering lesson we've successfully imported from neuroscience — convolutions, attention, replay, pruning, noise injection, adaptive modulation — has delivered outsize returns. The remaining lessons — mixed selectivity as a representational principle, predictive coding as a compute-allocation strategy, Von Economo-style fast integration as a coordination mechanism, parallel cognitive streams as an architectural pattern — are at least as promising, and none has been seriously attempted at scale.

The next architecture that matters will not be a bigger transformer. It will be a system that processes the predictable cheaply and the surprising deeply. That maintains multiple parallel cognitive streams integrated by fast, sub-linguistic coordination. That builds representations grounded in interaction, not just observation. That allocates its resources the way a brain does — stingily, strategically, and in constant dialogue with a predictive model of what comes next.

The scaling curve is not going to bend itself. The blueprint is sitting in the neuroscience literature, tested by 600 million years of R&D on the hardest optimisation problem in the known universe.

Use it.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news

Member discussion