Almanac
← Events
5arXiv cs.LG (Machine Learning)·17h ago

MDM-VGB: Theoretically grounded test-time scaling for masked diffusion models via reward-guided remasking

Researchers introduce MDM-VGB, a discrete diffusion sampler for Masked Diffusion Models that augments token unmasking with reward-guided remasking inspired by the Jerrum-Sinclair backtracking Markov chain. The method extends backtracking from a fixed prefix tree to a masked-state graph, enabling tokens to be unmasked and remasked at arbitrary positions to favor higher-reward partial configurations. The authors prove quadratic complexity and robustness to process-verifier noise, contrasting with exponential complexity of best-of-N heuristics, and validate on constraint-satisfaction benchmarks including Sudoku and QM9.

Related guides (2)

Related events (8)

5arXiv · cs.CL·19d ago·source ↗

ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models

Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.

6arXiv · cs.CL·28d ago·source ↗

Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding

This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.

6arXiv · cs.LG·1mo ago·source ↗

Looped Diffusion Language Models (LoopMDM): Depth Scaling via Layer Looping

LoopMDM introduces selective looping of early-middle transformer layers in masked diffusion language models, achieving a depth-scaling effect without adding parameters. The approach matches same-size MDM performance with up to 3.3× fewer training FLOPs and outperforms deeper non-looped MDMs on reasoning benchmarks, including up to 8.5 points improvement on GSM8K. Inference-time compute scaling is enabled by varying loop counts, with adaptive loop scheduling providing additional efficiency gains. Attention analysis suggests looping works by promoting interactions among masked token positions.

6arXiv · cs.AI·27d ago·source ↗

SimSD: Speculative Decoding Adapted for Diffusion Language Models

SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.

6arXiv · cs.LG·1mo ago·source ↗

GADD: Gibbs-Accelerated Discrete Diffusion Achieves Polylog Sampling Complexity

This paper introduces Gibbs-Accelerated Discrete Diffusion (GADD), a corrector method for uniform-rate discrete diffusion models that constructs Gibbs posterior likelihoods directly from the concrete score function without additional training. GADD achieves O(polylog(ε⁻¹)) sampling complexity, the first such rate for diffusion-based samplers in this setting. Experiments on synthetic data, zero-shot text sampling, and zero-shot conditional music generation show consistent improvements in sample quality and wall-clock efficiency over Euler and CTMC baselines. The work also introduces a novel induction-based theoretical framework for analyzing predictor-corrector methods in discrete diffusion.

5arXiv · cs.LG·12d ago·source ↗

Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control

Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.

5arXiv · cs.CL·18d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.

5arXiv · cs.CL·5d ago·source ↗

FMLM+ introduces Posterior Refinement for fast non-autoregressive language generation

Researchers introduce FMLM+, a framework combining Flow Map Language Models with masking-style noise schedules to enable joint sequence generation with per-token global consistency scoring. The key contribution is Posterior Refinement, an inference-time self-correction strategy that matches discrete baseline performance with 32x fewer neural function evaluations (NFEs). The approach improves the speed-quality tradeoff over both Masked Diffusion Models and standard FLMMs across multiple benchmarks, addressing longstanding factorization error problems in non-autoregressive generation.