The Continuous Thought Machine (CTM), proposed by Darlow et al. (2025), arrives with an unusual premise: that the missing ingredient in deep learning isn't more data, more parameters, or more layers — it's time itself. Not time as an input dimension, but time as the medium in which computation unfolds, measured and synchronised at the level of individual neurons.
It's a bold claim. Whether it holds up under scrutiny depends on which questions you think matter most.
The Core Idea
Most modern architectures treat neurons as stateless relays. A Transformer recalculates attention from scratch at every layer. An LSTM applies the same transition function uniformly across all units. Information flows forward, gets transformed, and exits. Whatever "thinking" happens is an emergent side effect of depth.
CTM inverts this. Each neuron maintains its own internal clock and accumulates a rolling history of its activations. The network processes inputs not in a single forward pass but over a series of self-generated internal "ticks" — iterations of computation that are decoupled from the input sequence. A single image can be mulled over for dozens of cycles, with different neurons falling in and out of coordination as the model explores its internal hypothesis space.
Three design choices define the architecture:
Neuron-level temporal dynamics. Every neuron carries its own activation history, processed through a learned synapse function. This makes neurons heterogeneous by design — each develops its own temporal fingerprint — a sharp contrast to the uniform units in standard architectures.
Self-generated internal timelines. The number of "thought ticks" is not fixed. The model learns to halt when it reaches sufficient certainty, a behaviour that emerges from training with uncertainty-aware loss functions rather than being imposed by a separate halting mechanism.
Synchronisation as representation. This is the most distinctive and contentious feature. Rather than using activation vectors as its latent space, CTM computes pairwise timing correlations between neurons — a synchronisation matrix — and uses this as the principal substrate for downstream computation.
What It Gets Right
The strongest case for CTM is not any single benchmark result but what it reveals about an under-explored axis of neural network design.
Deep learning has spent a decade betting almost exclusively on rate coding — the magnitude of activations as the carrier of information. CTM tests the complementary hypothesis: that temporal correlation structure between units carries representational content that activations alone miss. This isn't an appeal to biological metaphor. It's an empirical question, and the ablation studies suggest the answer is yes — stripping out the synchronisation component degrades performance meaningfully, not marginally.
The adaptive computation results are genuinely interesting. While the idea of "thinking longer on harder problems" isn't new — Graves' Adaptive Computation Time (2016) and PonderNet (Bansal et al., 2022) explored learned halting — CTM differs in what happens during the extra computation. ACT and PonderNet add a stopping mechanism to otherwise standard architectures. CTM builds temporal dynamics into the architecture's foundations and gets halting as an emergent property. The distinction matters: it's the difference between a model that waits longer and a model that develops qualitatively different internal processes as a function of time.
The cross-domain breadth is also notable for a debut paper. CTM is validated on image classification (ImageNet-1K), sequential reasoning (sorting, parity), spatial planning (2D mazes), and continual reinforcement learning. Most new architectures launch on a single task type. Showing competence across perception, reasoning, and planning in one paper — with one architecture — is ambitious and, in this case, largely successful.
On ImageNet-1K, CTM matches ResNet-50's top-1 accuracy (76.1%) with 40% fewer parameters and better calibration under distribution shift. On maze tasks, it solves mazes up to four times larger than those seen during training. On sorting and parity, it generalises to longer sequences than the training set, with internal dynamics that reveal step-by-step logic absent from baselines.
Where the Sceptics Have a Point
The benchmarks, while broad, are carefully chosen to favour CTM's inductive biases — and the strongest comparison points are dated. Matching ResNet-50 in 2025 is a sanity check, not a statement. The absence of comparisons against modern vision backbones like ConvNeXt or EfficientNet leaves the vision results feeling like a box-ticking exercise. The reasoning and planning tasks are where CTM shines, but these are also relatively small-scale, discrete problems where iterative refinement is the obvious strategy.
The deeper concern is scalability. Maintaining per-neuron activation histories and computing pairwise synchronisation matrices is O(n²) in neuron count per tick, multiplied across all ticks. The authors acknowledge the computational cost but treat it as an engineering detail. Critics will argue it's an architectural constraint — and the finding that performance degrades steeply when synchronisation history is truncated suggests you can't easily approximate your way around it.
This is the most substantive objection, but it's worth noting that history counsels caution before declaring an architecture dead on cost grounds. Transformers were O(n²) in sequence length and widely considered impractical for long contexts. FlashAttention, sparse attention, linear approximations, and KV caching changed that calculus over several years of engineering effort. Nothing about CTM's design precludes sparse synchronisation, hierarchical neuron grouping, or learned attention over history subsets. But these solutions are hypothetical today, and the burden of proof sits with the architecture's proponents.
The interpretability story also deserves scrutiny. CTM's synchronisation matrices can be visualised over time, revealing what look like internal plans — in maze-solving, you can watch the model explore dead ends and backtrack, step by step. This is genuinely striking. But it's important to distinguish between emergent interpretability on toy problems and scalable interpretability on real-world tasks. A deep recurrent network spontaneously developing visible planning behaviour is a qualitatively different finding from post-hoc probing — the network wasn't optimised for transparency, yet its temporal structure makes internal processes visible. That's noteworthy. Whether it survives contact with problems you can't visualise on a 2D grid is an open question the paper doesn't address.
The Neuroscience Question
CTM draws inspiration from neural synchronisation — the finding that biological neurons coordinate through precise timing relationships, not just firing rates. Critics will note that CTM's "synchronisation matrices" are pairwise correlation statistics, not oscillatory dynamics. There's no gamma binding, no theta-gamma coupling, no phase precession. Calling this "synchronisation" borrows neuroscience prestige without mechanistic fidelity.
The counter-argument is that this criticism confuses faithful simulation with useful abstraction. Backpropagation bears no resemblance to synaptic plasticity, yet it captured something essential about credit assignment that has powered three decades of deep learning. CTM doesn't need to reproduce cortical oscillations. It needs to test whether temporal correlation structure between artificial neurons adds computational value beyond what activations provide. The ablations suggest it does.
There's a more subtle critique, though: biological brains use both rate coding and temporal coding. The entire deep learning canon has bet on rate coding; CTM bets on temporal coding. Neither alone accounts for how cortex actually computes. A future architecture might benefit from combining both — persistent activation states enriched with synchronisation structure. But isolating and testing the temporal hypothesis in its pure form is a necessary first step, and that's what CTM provides.

Implications — With Appropriate Restraint
Every architecture paper since 2017 has included a paragraph gesturing at AGI. CTM's is more restrained than most, explicitly stating that it is not AGI, while suggesting that flexible internal simulation, adaptive computation, and interpretable reasoning over time might be important design principles for more general systems.
This is speculative, and honestly assessed, unfalsifiable at present. But the demonstrated properties — a single architecture handling perception, reasoning, and planning; the ability to generalise beyond training distribution on structural tasks; emergent interpretability without explicit optimisation — are concrete capabilities, not just narrative framing. The question is whether these properties are artefacts of small scale or genuine signatures of a useful computational paradigm.
The Verdict
CTM is best understood not as a competitor to Transformers but as an exploration of a largely abandoned design axis. It asks: what happens if you give neurons individual temporal identities, let the network pace its own computation, and use timing relationships — not just activation magnitudes — as the substrate for representation?
The answer, so far, is promising but provisional. The architecture produces adaptive computation, cross-domain generalisation, and emergent internal plans that are visible and interpretable. It does so at significant computational cost, on tasks that are small by modern standards, with benchmark comparisons that don't always reach the current frontier.
Dismissing CTM because its first implementation is expensive and its benchmarks are modest would repeat a familiar mistake — the same logic would have killed Transformers in 2017. But accepting CTM's framing uncritically, particularly the neuroscience motivation and AGI implications, would be equally premature.
The most honest assessment is this: CTM has demonstrated that temporal dynamics and neural synchronisation are not just biological curiosities but computationally productive principles in artificial systems. Whether those principles can be made practical at scale is an engineering question that remains entirely open. The interesting bet is not on CTM as a product, but on time itself as a first-class citizen in neural network design. That bet, at least, deserves to play out.
Subscribe to my newsletter to get the latest updates and news
Member discussion