5arXiv cs.CL (Computation and Language)·17h ago

FMLM+ introduces Posterior Refinement for fast non-autoregressive language generation

Researchers introduce FMLM+, a framework combining Flow Map Language Models with masking-style noise schedules to enable joint sequence generation with per-token global consistency scoring. The key contribution is Posterior Refinement, an inference-time self-correction strategy that matches discrete baseline performance with 32x fewer neural function evaluations (NFEs). The approach improves the speed-quality tradeoff over both Masked Diffusion Models and standard FLMMs across multiple benchmarks, addressing longstanding factorization error problems in non-autoregressive generation.

Frontier Model Releases Inference Economics Posterior Refinement Flow Map Language Models FMLM+Masked Diffusion Models Posterior Refinement: Fast Language Generation via Any-Order Flow Maps

Related guides (2)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·23d ago·source ↗

Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding

This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.

Frontier Model Releases Evaluation and Benchmarking BLEU-4 Graph Transformer Diffusion Language Models +5 more

4arXiv · cs.LG·11h ago·source ↗

FlowPipe: LLM-conditioned Generative Flow Networks for automated data preparation pipeline construction

FlowPipe is a new framework that frames ML data preparation pipeline synthesis as conditional probabilistic flow generation over a directed acyclic graph, using Conditional Generative Flow Networks (C-GFlowNets) with a Trajectory Balance objective. LLM-derived semantic priors are injected into the policy via Feature-wise Linear Modulation (FiLM), and a failure-aware flow objective steers search away from invalid states. Evaluated on 74 real-world datasets across two benchmark suites, FlowPipe improves accuracy by 11.96% on average over SOTA baselines and achieves 12.5x faster training convergence. The work addresses long-standing limitations in automated data pipeline construction including weak credit assignment and inefficient exploration.

Agent and Tool Ecosystem Feature-wise Linear Modulation Trajectory Balance Conditional Generative Flow Networks +1 more

6arXiv · cs.LG·29d ago·source ↗

Looped Diffusion Language Models (LoopMDM): Depth Scaling via Layer Looping

LoopMDM introduces selective looping of early-middle transformer layers in masked diffusion language models, achieving a depth-scaling effect without adding parameters. The approach matches same-size MDM performance with up to 3.3× fewer training FLOPs and outperforms deeper non-looped MDMs on reasoning benchmarks, including up to 8.5 points improvement on GSM8K. Inference-time compute scaling is enabled by varying loop counts, with adaptive loop scheduling providing additional efficiency gains. Attention analysis suggests looping works by promoting interactions among masked token positions.

Training Infrastructure Frontier Model Releases Transformers Layer Looping LoopMDM +4 more

5arXiv · cs.LG·29d ago·source ↗

Squeezing Capacity from MLLMs for Subject-driven Image Generation via Dual Layer Aggregation

This paper proposes conditioning diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, augmented with VAE-based identity conditioning to address copy-paste artifacts and identity preservation failures in subject-driven image generation. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features, and a multi-stage denoising strategy progressively balances semantic and fine-detail identity signals during inference. Experiments show improved human preference scores on subject-driven generation benchmarks compared to prior approaches that encode text and reference images separately.

Agent and Tool Ecosystem Multimodal Progress Multimodal Large Language Models Dual Layer Aggregation (DLA)Subject-driven Image Generation +3 more

6arXiv · cs.CL·19d ago·source ↗

NF-CoT: Latent reasoning with normalizing flows preserves autoregressive LLM advantages

Researchers propose NF-CoT, a latent reasoning framework that replaces discrete chain-of-thought token streams with continuous intermediate states modeled by normalizing flows embedded inside an LLM backbone. The approach uses a TARFlow-style normalizing flow head alongside the standard language model head, enabling exact likelihoods, KV-cache-compatible left-to-right decoding, and policy-gradient optimization in latent space. On code-generation benchmarks, NF-CoT improves pass rates over both explicit CoT and prior latent-reasoning baselines while reducing intermediate reasoning cost. The work addresses a key limitation of existing latent reasoning methods, which typically sacrifice probabilistic tractability or autoregressive compatibility.

Inference Economics Alignment and RLHF TARFlow NF-CoT Latent Reasoning with Normalizing Flows

5Hugging Face Blog·1mo ago·source ↗

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

NVIDIA's Nemotron-Labs introduces diffusion-based language models targeting extremely fast text generation, published as a Hugging Face blog post. The piece covers the approach of using diffusion processes for language modeling as an alternative to autoregressive generation, with a focus on inference speed. This represents a continued push by NVIDIA's research arm into non-autoregressive generation paradigms.

Frontier Model Releases Inference Economics Diffusion Language Models NVIDIA Hugging Face +3 more

5arXiv · cs.CL·35h ago·source ↗

LangMAP: Language-adaptive tokenization from a shared vocabulary without language identification at inference

LangMAP (Language-adaptive Maximum a Posteriori Tokenization) extends the UnigramLM algorithm to produce language-specific tokenizations from a single shared vocabulary, eliminating the need to retrain models or swap vocabularies for multilingual settings. A key property is that language labels are only required at training time; inference proceeds without language identification. Evaluated across 14 tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and AST-leaf alignment for all coding languages tested. Fine-tuning results are mixed: consistent gains on grammatical acceptability (MultiBLiMP) but less consistent on knowledge tasks (Global-PIQA, Belebele).

Evaluation and Benchmarking Open Weights Progress UnigramLM Global-PIQA MultiBLiMP +2 more

6arXiv · cs.AI·22d ago·source ↗

SimSD: Speculative Decoding Adapted for Diffusion Language Models

SimSD introduces a training-free speculative decoding algorithm for diffusion large language models (dLLMs), which previously could not use standard token-level speculative decoding due to their bidirectional attention and masked language modeling formulation. The method uses a plug-and-play masking strategy that introduces reference tokens from a draft model and a custom attention mask, enabling valid logit computation for drafted tokens in a single forward pass. Evaluated on SDAR-family dLLMs across four benchmarks, SimSD achieves up to 7.46x decoding throughput improvement while maintaining or improving generation quality. The approach is compatible with other acceleration techniques such as KV cache and blockwise decoding.

Frontier Model Releases Inference Economics KV Cache speculative decoding SDAR +4 more