Training-Free Looped Transformers: Inference-Time Recurrence via ODE-Motivated Layer Reapplication
The paper introduces a method to retrofit recurrence onto frozen pretrained transformer checkpoints at inference time by looping a contiguous mid-stack block of layers without any fine-tuning or architectural changes. Naive block reapplication degrades performance, so the authors motivate their approach by treating pre-norm transformer blocks as forward Euler ODE steps and replacing one large update with smaller damped sub-steps. Evaluated across seven model families including dense, sparse MoE, and MLA+MoE architectures, the method yields consistent benchmark improvements (e.g., +2.64 pp on MMLU-Pro for Qwen3-4B-Instruct) at no training cost.
Related guides (3)
Related events (8)
Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs
Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.
Looped Diffusion Language Models (LoopMDM): Depth Scaling via Layer Looping
LoopMDM introduces selective looping of early-middle transformer layers in masked diffusion language models, achieving a depth-scaling effect without adding parameters. The approach matches same-size MDM performance with up to 3.3× fewer training FLOPs and outperforms deeper non-looped MDMs on reasoning benchmarks, including up to 8.5 points improvement on GSM8K. Inference-time compute scaling is enabled by varying loop counts, with adaptive loop scheduling providing additional efficiency gains. Attention analysis suggests looping works by promoting interactions among masked token positions.
Hyperfitting Explained: Terminal Geometric Expansion in Final Transformer Layers Drives Diversity Gains
This paper investigates the 'hyperfitting' phenomenon—where fine-tuning LLMs to near-zero loss on small datasets improves open-ended generation and reduces repetition—and demonstrates it is mechanistically distinct from temperature scaling. Entropy-matched control experiments falsify both the temperature-equivalence and static vocabulary reweighting hypotheses, instead localizing the effect to a 'Terminal Expansion' in the final transformer block where feature-space dimensionality expands by ~80.8 dimensions, enabling promotion of deep-tail tokens via context-dependent rank reordering. The authors introduce Late-Stage LoRA, a targeted fine-tuning strategy updating only the final 5 layers, achieving robust generation with minimal parameter updates.
Supervised Memory Training enables parallel RNN pretraining without backpropagation through time
A new arXiv preprint proposes Supervised Memory Training (SMT), a method that trains recurrent neural networks by reducing the problem to supervised learning on one-step memory transitions, bypassing backpropagation through time entirely. A Transformer-based encoder generates memory labels via a predictive state objective, enabling time-parallel training with O(1) gradient path length between any two tokens. SMT outperforms BPTT on language modeling and pixel sequence modeling tasks across multiple RNN architectures. The approach could enable RNNs to scale more effectively by decoupling memory content from update mechanics.
Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers
A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.
Looped World Models introduce iterative latent depth as a new scaling axis for world simulation
A new arXiv preprint introduces Looped World Models (LoopWM), a parameter-shared transformer architecture that iteratively refines latent environment states to achieve up to 100x parameter efficiency over conventional world models. The approach uses adaptive computation to scale depth dynamically per prediction step, addressing the tension between long-horizon simulation fidelity and deployment cost. The authors position iterative latent depth as a new scaling axis orthogonal to model size and training data.
Local linear structures in LLM weights and activations are dynamic, not fixed global directions
A new arXiv paper investigates the nature of linear structures in transformer weights and activations, finding strong local low-rank task-gradient structure but rejecting the hypothesis that fixed task planes exist. The authors show that useful bases drift substantially within 100 optimization steps, yet early recovery updates form a trajectory-prefix basis capturing 77% of LoRA recovery displacement. They also establish a formal connection between parameter perturbations and activation steering, finding a 0.58 cosine similarity between gradient-step-induced activation shifts and CAA steering vectors, suggesting linear structures are evolving local geometries rather than stable global task directions.
Variable-Width Transformers: X-shaped architecture outperforms uniform-width baselines with 22% fewer FLOPs
Researchers propose the ><former (X-shaped transformer), a decoder-only architecture that uses wider early and late layers with narrower middle layers, implemented via a parameter-free residual resizing mechanism. Evaluated on models from 200M to 2B dense parameters and 3B MoE, the architecture consistently outperforms parameter-matched uniform-width baselines on language modeling loss. The design yields a 22% reduction in FLOPs and 15% reduction in KV cache memory under fitted scaling curves, suggesting nonuniform width allocation is a viable path to more compute-efficient language models.


