Almanac
← Events
6arXiv cs.CL (Computation and Language)·2d ago

Sumi: First open 7B uniform diffusion language model pretrained from scratch at scale

Researchers introduce Sumi, a fully open 7B uniform diffusion language model (UDLM) pretrained from scratch on 1.5 trillion tokens — the first UDLM at both large parameter scale and large token budget. Sumi performs competitively with autoregressive models on knowledge, reasoning, and coding benchmarks, though underperforms on commonsense tasks, attributed partly to an education-heavy data mixture. Model weights, checkpoints, and full training recipe including data mixture specification are released publicly. The work fills a gap in the diffusion language model landscape, providing a reference point for studying scaling behavior and generation dynamics in uniform diffusion.

Related guides (2)

Related events (8)

8Hugging Face Blog·1mo ago·source ↗

Introducing BLOOM: The World's Largest Open Multilingual Language Model

Hugging Face and the BigScience workshop released BLOOM, a 176-billion parameter open-access multilingual language model trained on 46 natural languages and 13 programming languages. The model was developed collaboratively by over 1,000 researchers and represents a significant milestone in open-weights large language model development. BLOOM was designed to be freely accessible to researchers and practitioners, in contrast to proprietary models of similar scale.

6arXiv · cs.CL·1mo ago·source ↗

RePlaid: Continuous Diffusion Language Models Scale Competitively with Discrete Diffusion

This paper revisits continuous diffusion language models (DLMs) by introducing RePlaid, an updated version of Plaid that aligns its architecture with modern discrete DLMs. RePlaid establishes the first scaling law for continuous DLMs competitive with discrete approaches, achieving a compute gap of only 20× versus autoregressive models and a state-of-the-art perplexity bound of 22.1 on OpenWebText among continuous DLMs. The authors provide theoretical analysis showing that likelihood-based training naturally yields linear cross-entropy over time and creates structured embedding geometries, explaining the performance gains.

6arXiv · cs.AI·18d ago·source ↗

SimSD: Speculative Decoding Adapted for Diffusion Language Models

SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.

8Qwen Research·1mo ago·source ↗

Qwen2.5-LLM: Alibaba releases open-weight language models from 0.5B to 72B

Alibaba's Qwen team releases the Qwen2.5 series of decoder-only dense language models, open-sourcing seven variants spanning 0.5B to 72B parameters. The release targets production use cases in the 10-30B range and mobile deployments at 3B scale. This represents a significant expansion of the open-weights frontier from a Tier 1 Chinese AI lab.

6arXiv · cs.LG·2d ago·source ↗

Diffusion-Proof: First framework applying diffusion LLMs to formal theorem proving

Researchers introduce Diffusion-Proof, the first framework to train and apply diffusion language models (dLLMs) for formal theorem proving, addressing limitations of autoregressive models in long-range coherence. The framework includes dLLM-Prover-7B for whole-proof generation and dLLM-Corrector-7B for local proof correction via bidirectional infilling. Diffusion-Proof achieves absolute improvements of 1.61% on ProofNet-Test and 6.14% on MiniF2F-Test over an AR baseline, and solves one IMO problem that DeepSeek-Prover-V2-7B could not. The result suggests dLLMs may have structural advantages over AR models for tasks requiring long-range logical coherence.

5arXiv · cs.CL·3d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

5arXiv · cs.LG·15d ago·source ↗

SARDI: Self-Augmenting Retrieval for Diffusion Language Models using lookahead tokens

Researchers introduce SARDI, a training-free RAG framework for discrete diffusion language models that repurposes discarded low-confidence tokens during denoising as lookahead signals to guide retrieval before output is finalized. The method is retriever-agnostic and applicable to any reasoning-capable discrete diffusion LM. Evaluated across five multi-hop QA benchmarks, SARDI outperforms training-free diffusion and autoregressive retrieval baselines at up to 8x higher throughput.

5arXiv · cs.CL·9d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.