5Hugging Face Blog·2h ago

Allen AI introduces DiScoFormer: unified transformer for density estimation and score functions

Allen AI presents DiScoFormer, a single transformer architecture capable of jointly estimating probability densities and score functions across different distributions. The work is published on the Hugging Face blog, suggesting an accompanying model or code release. Unifying density and score estimation in one model has implications for generative modeling, diffusion models, and probabilistic inference.

Frontier Model Releases Evaluation and Benchmarking DiScoFormer Hugging Face Allen Institute for AI

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·4d ago·source ↗

AllenAI analysis: which tokens do hybrid models predict better than pure transformers?

A Hugging Face blog post from AllenAI investigates the token-level prediction differences between hybrid models (combining attention and state-space or other mechanisms) and standard transformer architectures. The analysis aims to characterize where hybrid architectures gain or lose predictive advantage at the token level. This kind of mechanistic comparison is relevant to ongoing debates about when hybrid designs are worth their added complexity.

Frontier Model Releases Evaluation and Benchmarking Hugging Face Allen Institute for AI

6arXiv · cs.AI·6d ago·source ↗

DiT-Reward converts text-to-image diffusion transformers into reward models, outperforming HPSv3

DiT-Reward is a new reward modeling approach that repurposes pretrained text-to-image Diffusion Transformers (DiTs) by processing near-clean image latents and aggregating text-conditioned representations across transformer layers. Under matched training data, it outperforms HPSv3 on all four evaluated preference benchmarks, reaching 85.6% on HPDv2 and 77.6% on HPDv3. When used to optimize Stable Diffusion 3.5 Large via Flow-GRPO, it shows clear gains in realism and achieves a 1.65x inference speedup over HPSv3. The work demonstrates that generative DiT representations transfer meaningfully to reward modeling and policy optimization.

Evaluation and Benchmarking Alignment and RLHF Flow-GRPO HPSv3 HPDv2 +4 more

6arXiv · cs.CL·26d ago·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

Training Infrastructure Frontier Model Releases Triton Mamba Gated DeltaNet-2 +1 more

5Hugging Face Blog·1mo ago·source ↗

Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny

Hugging Face has open-sourced knowledge distillation code and model weights for two compressed variants of Stable Diffusion: SD-Small and SD-Tiny. These distilled models are smaller and faster than the original Stable Diffusion, targeting inference efficiency. The release includes both the trained weights and the distillation training code, enabling the community to reproduce or extend the work.

Open Weights Progress Inference Economics SD-Tiny knowledge distillation SD-Small +3 more

6Google Deepmind Blog·1mo ago·source ↗

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

DeepMind has published a blog post introducing Decoupled DiLoCo, a new approach to distributed AI training designed for resilience across heterogeneous or unreliable compute environments. The method appears to extend the original DiLoCo (Distributed Low-Communication) training framework, which enables training across loosely connected compute nodes with infrequent synchronization. The announcement signals continued investment in infrastructure techniques that reduce communication overhead and improve fault tolerance in large-scale model training.

Training Infrastructure Inference Economics DiLoCo Decoupled DiLoCo Google DeepMind

7Hugging Face Blog·1mo ago·source ↗

Transformers v5: Simple model definitions powering the AI ecosystem

Hugging Face has announced Transformers v5, a major version update to its flagship open-source library. The release focuses on simplified model definitions and architectural improvements to the codebase. As one of the most widely used ML libraries in the ecosystem, this update has broad implications for researchers and practitioners building on top of the Transformers framework.

Open Weights Progress Inference Economics Transformers Hugging Face +1 more

5arXiv · cs.CL·12d ago·source ↗

Variable-Width Transformers: X-shaped architecture outperforms uniform-width baselines with 22% fewer FLOPs

Researchers propose the ><former (X-shaped transformer), a decoder-only architecture that uses wider early and late layers with narrower middle layers, implemented via a parameter-free residual resizing mechanism. Evaluated on models from 200M to 2B dense parameters and 3B MoE, the architecture consistently outperforms parameter-matched uniform-width baselines on language modeling loss. The design yields a 22% reduction in FLOPs and 15% reduction in KV cache memory under fitted scaling curves, suggesting nonuniform width allocation is a viable path to more compute-efficient language models.

Frontier Model Releases Inference Economics Q-Former Variable-Width Transformers

6arXiv · cs.AI·25d ago·source ↗

DistIL: Distributional DAgger for RL from Rich Feedback beyond single-bit rewards

A new arXiv preprint introduces DistIL, a distributional variant of the DAgger imitation learning algorithm designed to exploit rich feedback signals (execution traces, tool outputs, expert corrections) rather than the single-bit correctness reward used in standard RLVR. The method uses a forward cross-entropy objective that provides monotonic policy improvement guarantees, unlike reverse KL or Jensen-Shannon divergence objectives used in prior self-distillation approaches. Empirically, DistIL outperforms RLVR and self-distillation baselines on scientific reasoning, coding, and hard math benchmarks.

Frontier Model Releases Alignment and RLHF DAgger DistIL Reinforcement Learning with Verifiable Rewards +1 more