7M Parameters Crushing 100B Models: Two Recursive Reasoning Papers Shake the Scaling Faith
Source: Y Combinator | Published: 2026-05-01T14:00:32Z
A 7-million-parameter recursive model trained from scratch hits ~45% accuracy on ARC-AGI-1, outperforming hundred-billion-parameter reasoning models like DeepSeek-R1 and o3-mini.
A 7-million-parameter model scored roughly 45% on ARC-AGI-1, outperforming hundred-billion-parameter reasoning models like DeepSeek-R1, o3-mini, and Gemini 2.5 Pro. Not long ago, large models like GPT-3 and GPT-4 scored near zero on the same benchmark with direct inference. This isn't clickbait — two 2025 papers on recursive reasoning are challenging the foundational assumption that bigger models are smarter.
The Hard Limit Exposed by Sorting Algorithms
Comparison sorting has a theoretical lower bound: at least n log n comparison steps. If a Transformer has only 30 layers and the list to sort has 31 elements, it simply cannot complete all the necessary comparisons in a single forward pass — there aren't enough steps. Sudoku, mazes, rolling sums — these are all "incompressible problems": you can't skip intermediate steps and guess the answer directly.
LLMs currently have two workarounds for this limitation: chain of thought and tool calling. The former lets the model write intermediate steps as token output, then read them back to continue reasoning. The latter is simpler — just call Python's `sort()` function. But both approaches hit the same ceiling: they're bounded by existing human knowledge. If humans only knew bubble sort, a chain-of-thought-trained model would only know bubble sort — and poorly at that. Tool calling is even more constrained — you have to know the tool exists first.
This is the key distinction that François Chollet, Ndea co-founder and creator of the ARC-AGI benchmark, and the host repeatedly emphasized: LLM "recursion" happens in discrete token space, not in the model's internal continuous latent space. Token space is low-dimensional with limited expressiveness — like having a very long strip of tape but only being able to divide it into 10 buckets.
RNNs: The Old Dream and Its Old Wounds
Recurrent neural networks inherently possess this internal recursive capability — the same set of weights applied repeatedly to inputs, with intermediate states compressed into hidden vectors, no need to retain full context. Around 2016, Alex Graves's work on Neural Turing Machines and adaptive computation time represented the peak of this approach.
But RNNs had a fatal flaw: backpropagation through time (BPTT). Each recursive step requires an additional backward pass during training to compute gradients. When sequence lengths reach millions or billions, you need to retain activations at every step — as if your brain needed millions of copies of itself to perform a single gradient update. Gradients either vanish or explode, and noise accumulates until the model simply can't learn.
Transformers used the triangular causal mask to process all timesteps in parallel within a single forward pass, completely sidestepping the vanishing gradient problem. The tradeoff: giving up compression along the time dimension — every time a new token is decoded, the entire context history must be retained.
HRM: Three Layers of Recursion and a Counterintuitive Training Trick
The Hierarchical Reasoning Model (HRM) directly inherits the RNN lineage but uses a clever method to bypass BPTT's limitations.
Architecturally, it features three nested layers of recursion: a low-level module (L-net) that recurses TL times, a high-level module (H-net) that recurses TH times, and an outer layer with N refinement steps. Think of it as three levels of nested function calls — the inner function has its own local variables ZL (hidden state), and when it finishes, it passes results to the outer function, which updates its own state ZH and feeds the updated context back to the inner function, and so on.
The key innovation is in the training method. Alex Graves's earlier approach propagated gradients through all recursive steps, hence the BPTT bottleneck. HRM borrows from Deep Equilibrium Learning: gradients are backpropagated through only one step, then truncated. But the hidden states ZL and ZH are not reset — the model reruns the same batch of inputs using the updated hidden states.
In computer vision, this sounds absurd: running the same batch of data 16 times? But it's mathematically sound, because while input X stays the same each time, the hidden states occupy different positions in latent space — essentially constructing mini-batches over hidden-state space rather than data space.
The result: a 27-million-parameter model, trained from scratch (no pretraining), using only about a thousand ARC Prize puzzles, achieved roughly 41% accuracy on the ARC-AGI-1 public evaluation set — surpassing mainstream reasoning models like o3-mini-high (34.5%) and Claude 3.7 (21.2%).
TRM: Remove 75%, Get Better Performance
Tiny Recursive Model author Alexiaa performed a radical simplification of HRM.
She merged the two separate networks (L-net and H-net) into a single shared-weight network and reduced the four-layer Transformer to just one layer. Parameters dropped from 28 million to 7 million — a 4x reduction. But the two hidden states ZL and ZH remained independent because they serve different roles: ZL is local working memory, repeatedly overwritten and updated; ZH is the latent representation of the candidate answer, just one MLP mapping away from the final output.
The training difference is more subtle: HRM truncates gradients after a single module call, while TRM allows gradients to flow through one complete recursive cycle before truncating. This lets the model learn signals from deeper recursion. Alexiaa also discovered that the fixed-point convergence assumption used in the HRM paper to theoretically justify truncated training — that changes in ZL and ZH approach zero — doesn't actually hold. The model works, but not for the reason that mathematical argument claims.
The result: 7 million parameters, ~45% on ARC-AGI-1, and 87.4% on expert-level Sudoku. Four times smaller, yet stronger.
Sudoku as a Window into the EM Algorithm
Sudoku makes this recursive mechanism intuitive to understand.
Sudoku is a classic incompressible problem — you can't guess all cells at once; you can only deduce one or two cells at a time based on available information. ZL plays the role of scratch paper: try this, try that, do some local reasoning. Once it accumulates enough information, it passes results to ZH — which acts as your best guess of the board's current state, filling in confirmed numbers. Now the board has more known information, and ZL continues reasoning from the updated state.
This process closely mirrors the Expectation-Maximization (EM) algorithm: alternately updating hidden memory states (E-step) and candidate outputs (M-step), each conditioned on the other, iterating toward convergence. And the entire process happens in continuous latent space — no chain of thought needed, no encoding of intermediate reasoning steps as discrete tokens.
The model discovered its own solving strategy — no human-annotated reasoning traces, no chain-of-thought prompting.
The Limited Value of Biological Inspiration
One inspiration for the HRM paper came from neural oscillations at different frequencies in the brain — certain regions operate at high frequency to process low-level information, while others operate at low frequency for high-level abstraction. This hierarchical frequency structure mapped onto the L-net and H-net design.
But history shows that biological plausibility is valuable as inspiration yet unreliable as a design constraint. The "local response normalization" in AlexNet, which mimicked biological neuron refractory periods, turned out to be completely useless — VGG dropped it entirely, replaced it with deeper 3×3 convolutions, and dramatically outperformed AlexNet.
The more compelling argument comes from computation theory: a Turing machine needs a readable and writable tape for universal computation. Radix sort can break through the n log n comparison sorting lower bound precisely because of extra memory buckets. The hidden states in these recursive models are essentially that tape, those memory buckets — giving the model trainable external memory within a single forward pass.
The Outer Refinement Loop Is the Real Hero
Konstantin Schürholt, a researcher at François Chollet's company, ran detailed scaling ablation experiments on HRM. The conclusion was clear: among the three layers of recursion, the outer refinement loop contributes the most. Setting the low-level and high-level recursion steps to 2 is entirely sufficient; increasing them further yields virtually no gains.
Even more counterintuitively: the number of recursive steps during training matters, but the number during testing doesn't matter nearly as much. A model trained with 16 refinement steps captures most of its performance running just 1 step at test time. This suggests the core value of recursive training isn't giving the model more "thinking time" — it's shaping the model's internal representations. The training process teaches the model how to efficiently organize information in latent space.
Small Model + Large Model = ?
The 7-million-parameter model outperforming hundred-billion-parameter reasoning models on ARC-AGI-1 comes with an important caveat: both TRM and HRM are task-specific models. A model trained on Sudoku can't do ARC Prize puzzles — it must be trained separately. LLMs' advantage lies precisely in generality — next-token prediction gives them remarkable cross-task generalization.
But LLMs' strengths and recursive models' strengths are complementary. LLMs excel at learning high-quality embedding spaces — mapping tokens and pixels into latent spaces with clean semantic separation. Recursive models excel at reasoning within latent spaces. A natural combination: use large models to build the representation space, then deploy small recursive reasoning modules inside it.
François believes this combination may already be partially present in models like Gemini. But even TRM currently remains constrained by BPTT — Alexiaa's gradients only flow through one recursive step. If this training bottleneck can be broken at large-model scale while preserving the reasoning depth that recursion provides, an observation Melanie Mitchell made in her book becomes even more compelling: scaling up is a sufficient but not necessary condition for improving performance, and adding recursion is also a sufficient but not necessary condition — the possibilities of combining both remain largely unexplored.