Allen AI introduces DiScoFormer: unified transformer for density estimation and score functions
Allen AI presents DiScoFormer, a single transformer architecture capable of jointly estimating probability densities and score functions across different distributions. The work is published on the Hugging Face blog, suggesting an accompanying model or code release. Unifying density and score estimation in one model has implications for generative modeling, diffusion models, and probabilistic inference.
Related guides (3)
Related events (8)
AllenAI analysis: which tokens do hybrid models predict better than pure transformers?
A Hugging Face blog post from AllenAI investigates the token-level prediction differences between hybrid models (combining attention and state-space or other mechanisms) and standard transformer architectures. The analysis aims to characterize where hybrid architectures gain or lose predictive advantage at the token level. This kind of mechanistic comparison is relevant to ongoing debates about when hybrid designs are worth their added complexity.
DiT-Reward converts text-to-image diffusion transformers into reward models, outperforming HPSv3
DiT-Reward is a new reward modeling approach that repurposes pretrained text-to-image Diffusion Transformers (DiTs) by processing near-clean image latents and aggregating text-conditioned representations across transformer layers. Under matched training data, it outperforms HPSv3 on all four evaluated preference benchmarks, reaching 85.6% on HPDv2 and 77.6% on HPDv3. When used to optimize Stable Diffusion 3.5 Large via Flow-GRPO, it shows clear gains in realism and achieves a 1.65x inference speedup over HPSv3. The work demonstrates that generative DiT representations transfer meaningfully to reward modeling and policy optimization.
Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers
A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.
Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny
Hugging Face has open-sourced knowledge distillation code and model weights for two compressed variants of Stable Diffusion: SD-Small and SD-Tiny. These distilled models are smaller and faster than the original Stable Diffusion, targeting inference efficiency. The release includes both the trained weights and the distillation training code, enabling the community to reproduce or extend the work.
Decoupled DiLoCo: A new frontier for resilient, distributed AI training
DeepMind has published a blog post introducing Decoupled DiLoCo, a new approach to distributed AI training designed for resilience across heterogeneous or unreliable compute environments. The method appears to extend the original DiLoCo (Distributed Low-Communication) training framework, which enables training across loosely connected compute nodes with infrequent synchronization. The announcement signals continued investment in infrastructure techniques that reduce communication overhead and improve fault tolerance in large-scale model training.
Transformers v5: Simple model definitions powering the AI ecosystem
Hugging Face has announced Transformers v5, a major version update to its flagship open-source library. The release focuses on simplified model definitions and architectural improvements to the codebase. As one of the most widely used ML libraries in the ecosystem, this update has broad implications for researchers and practitioners building on top of the Transformers framework.
Variable-Width Transformers: X-shaped architecture outperforms uniform-width baselines with 22% fewer FLOPs
Researchers propose the ><former (X-shaped transformer), a decoder-only architecture that uses wider early and late layers with narrower middle layers, implemented via a parameter-free residual resizing mechanism. Evaluated on models from 200M to 2B dense parameters and 3B MoE, the architecture consistently outperforms parameter-matched uniform-width baselines on language modeling loss. The design yields a 22% reduction in FLOPs and 15% reduction in KV cache memory under fitted scaling curves, suggesting nonuniform width allocation is a viable path to more compute-efficient language models.
DistIL: Distributional DAgger for RL from Rich Feedback beyond single-bit rewards
A new arXiv preprint introduces DistIL, a distributional variant of the DAgger imitation learning algorithm designed to exploit rich feedback signals (execution traces, tool outputs, expert corrections) rather than the single-bit correctness reward used in standard RLVR. The method uses a forward cross-entropy objective that provides monotonic policy improvement guarantees, unlike reverse KL or Jensen-Shannon divergence objectives used in prior self-distillation approaches. Empirically, DistIL outperforms RLVR and self-distillation baselines on scientific reasoning, coding, and hard math benchmarks.


