FMLM+ introduces Posterior Refinement for fast non-autoregressive language generation
Researchers introduce FMLM+, a framework combining Flow Map Language Models with masking-style noise schedules to enable joint sequence generation with per-token global consistency scoring. The key contribution is Posterior Refinement, an inference-time self-correction strategy that matches discrete baseline performance with 32x fewer neural function evaluations (NFEs). The approach improves the speed-quality tradeoff over both Masked Diffusion Models and standard FLMMs across multiple benchmarks, addressing longstanding factorization error problems in non-autoregressive generation.
Related guides (2)
Related events (8)
Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding
This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.
FlowPipe: LLM-conditioned Generative Flow Networks for automated data preparation pipeline construction
FlowPipe is a new framework that frames ML data preparation pipeline synthesis as conditional probabilistic flow generation over a directed acyclic graph, using Conditional Generative Flow Networks (C-GFlowNets) with a Trajectory Balance objective. LLM-derived semantic priors are injected into the policy via Feature-wise Linear Modulation (FiLM), and a failure-aware flow objective steers search away from invalid states. Evaluated on 74 real-world datasets across two benchmark suites, FlowPipe improves accuracy by 11.96% on average over SOTA baselines and achieves 12.5x faster training convergence. The work addresses long-standing limitations in automated data pipeline construction including weak credit assignment and inefficient exploration.
Looped Diffusion Language Models (LoopMDM): Depth Scaling via Layer Looping
LoopMDM introduces selective looping of early-middle transformer layers in masked diffusion language models, achieving a depth-scaling effect without adding parameters. The approach matches same-size MDM performance with up to 3.3× fewer training FLOPs and outperforms deeper non-looped MDMs on reasoning benchmarks, including up to 8.5 points improvement on GSM8K. Inference-time compute scaling is enabled by varying loop counts, with adaptive loop scheduling providing additional efficiency gains. Attention analysis suggests looping works by promoting interactions among masked token positions.
Squeezing Capacity from MLLMs for Subject-driven Image Generation via Dual Layer Aggregation
This paper proposes conditioning diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, augmented with VAE-based identity conditioning to address copy-paste artifacts and identity preservation failures in subject-driven image generation. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features, and a multi-stage denoising strategy progressively balances semantic and fine-detail identity signals during inference. Experiments show improved human preference scores on subject-driven generation benchmarks compared to prior approaches that encode text and reference images separately.
NF-CoT: Latent reasoning with normalizing flows preserves autoregressive LLM advantages
Researchers propose NF-CoT, a latent reasoning framework that replaces discrete chain-of-thought token streams with continuous intermediate states modeled by normalizing flows embedded inside an LLM backbone. The approach uses a TARFlow-style normalizing flow head alongside the standard language model head, enabling exact likelihoods, KV-cache-compatible left-to-right decoding, and policy-gradient optimization in latent space. On code-generation benchmarks, NF-CoT improves pass rates over both explicit CoT and prior latent-reasoning baselines while reducing intermediate reasoning cost. The work addresses a key limitation of existing latent reasoning methods, which typically sacrifice probabilistic tractability or autoregressive compatibility.
Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
NVIDIA's Nemotron-Labs introduces diffusion-based language models targeting extremely fast text generation, published as a Hugging Face blog post. The piece covers the approach of using diffusion processes for language modeling as an alternative to autoregressive generation, with a focus on inference speed. This represents a continued push by NVIDIA's research arm into non-autoregressive generation paradigms.
LangMAP: Language-adaptive tokenization from a shared vocabulary without language identification at inference
LangMAP (Language-adaptive Maximum a Posteriori Tokenization) extends the UnigramLM algorithm to produce language-specific tokenizations from a single shared vocabulary, eliminating the need to retrain models or swap vocabularies for multilingual settings. A key property is that language labels are only required at training time; inference proceeds without language identification. Evaluated across 14 tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and AST-leaf alignment for all coding languages tested. Fine-tuning results are mixed: consistent gains on grammatical acceptability (MultiBLiMP) but less consistent on knowledge tasks (Global-PIQA, Belebele).
SimSD: Speculative Decoding Adapted for Diffusion Language Models
SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.

