MDM-VGB: Theoretically grounded test-time scaling for masked diffusion models via reward-guided remasking
Researchers introduce MDM-VGB, a discrete diffusion sampler for Masked Diffusion Models that augments token unmasking with reward-guided remasking inspired by the Jerrum-Sinclair backtracking Markov chain. The method extends backtracking from a fixed prefix tree to a masked-state graph, enabling tokens to be unmasked and remasked at arbitrary positions to favor higher-reward partial configurations. The authors prove quadratic complexity and robustness to process-verifier noise, contrasting with exponential complexity of best-of-N heuristics, and validate on constraint-satisfaction benchmarks including Sudoku and QM9.
Related guides (2)
Related events (8)
ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models
Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.
Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding
This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.
Looped Diffusion Language Models (LoopMDM): Depth Scaling via Layer Looping
LoopMDM introduces selective looping of early-middle transformer layers in masked diffusion language models, achieving a depth-scaling effect without adding parameters. The approach matches same-size MDM performance with up to 3.3× fewer training FLOPs and outperforms deeper non-looped MDMs on reasoning benchmarks, including up to 8.5 points improvement on GSM8K. Inference-time compute scaling is enabled by varying loop counts, with adaptive loop scheduling providing additional efficiency gains. Attention analysis suggests looping works by promoting interactions among masked token positions.
SimSD: Speculative Decoding Adapted for Diffusion Language Models
SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.
GADD: Gibbs-Accelerated Discrete Diffusion Achieves Polylog Sampling Complexity
This paper introduces Gibbs-Accelerated Discrete Diffusion (GADD), a corrector method for uniform-rate discrete diffusion models that constructs Gibbs posterior likelihoods directly from the concrete score function without additional training. GADD achieves O(polylog(ε⁻¹)) sampling complexity, the first such rate for diffusion-based samplers in this setting. Experiments on synthetic data, zero-shot text sampling, and zero-shot conditional music generation show consistent improvements in sample quality and wall-clock efficiency over Euler and CTMC baselines. The work also introduces a novel induction-based theoretical framework for analyzing predictor-corrector methods in discrete diffusion.
Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control
Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.
AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning
Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.
FMLM+ introduces Posterior Refinement for fast non-autoregressive language generation
Researchers introduce FMLM+, a framework combining Flow Map Language Models with masking-style noise schedules to enable joint sequence generation with per-token global consistency scoring. The key contribution is Posterior Refinement, an inference-time self-correction strategy that matches discrete baseline performance with 32x fewer neural function evaluations (NFEs). The approach improves the speed-quality tradeoff over both Masked Diffusion Models and standard FLMMs across multiple benchmarks, addressing longstanding factorization error problems in non-autoregressive generation.

