6arXiv cs.CL (Computation and Language)·5d ago

Weave of Formal Thought: Sound-and-complete constrained decoding with learned latent syntax for code LLMs

The paper introduces Weave of Formal Thought (WoFT), a framework combining a formally sound-and-complete constrained decoder for code generation with a latent-variable fine-tuning method that teaches LLMs to interleave grammar non-terminals during generation. The constrained decoder extends generalized LR (GLR) parsing with speculative lexing to handle context-sensitive lexing and maximal-munch tokenization, addressing gaps in prior constrained-decoding work. A reweighted wake-sleep (RWS) fine-tuning objective on StarCoder2-3B achieves a 14.3% relative reduction in per-token cross-entropy over a text-only SFT baseline on Python, suggesting that explicit structural scaffolding recovers information lost in flat autoregressive training.

Evaluation and Benchmarking Agent and Tool Ecosystem Generalized LR Parsing Tree-sitter Weave of Formal Thought StarCoder2 Reweighted Wake-Sleep

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·25d ago·source ↗

NF-CoT: Latent reasoning with normalizing flows preserves autoregressive LLM advantages

Researchers propose NF-CoT, a latent reasoning framework that replaces discrete chain-of-thought token streams with continuous intermediate states modeled by normalizing flows embedded inside an LLM backbone. The approach uses a TARFlow-style normalizing flow head alongside the standard language model head, enabling exact likelihoods, KV-cache-compatible left-to-right decoding, and policy-gradient optimization in latent space. On code-generation benchmarks, NF-CoT improves pass rates over both explicit CoT and prior latent-reasoning baselines while reducing intermediate reasoning cost. The work addresses a key limitation of existing latent reasoning methods, which typically sacrifice probabilistic tractability or autoregressive compatibility.

Inference Economics Alignment and RLHF TARFlow NF-CoT Latent Reasoning with Normalizing Flows

5arXiv · cs.CL·21d ago·source ↗

IS-CoT framework addresses long-form generation collapse in LLMs via interleaved structural thinking

Researchers introduce IS-CoT (Interleaved Structural Chain-of-Thought), a framework that embeds a dynamic Plan-Write-Reflect cycle into LLM generation to overcome severe length collapse observed in reasoning-enhanced models for open-ended writing tasks beyond 2,000 words. The authors construct a multi-teacher training dataset of interleaved reasoning traces and train IS-Writer-8B, which achieves state-of-the-art results on LongBench-Write, outperforming DeepSeek-V3.2 by 3.08 points. The work identifies static hierarchical planning as a root cause of long-form degradation and proposes an in-model alternative to external agentic workflows.

Long Context Evolution Evaluation and Benchmarking DeepSeek V4 LongBench-Write IS-Writer-8B +1 more

5arXiv · cs.CL·14d ago·source ↗

ASRD: Training-free anchor-guided revocable decoding for diffusion LLMs improves accuracy and throughput

A new arXiv preprint introduces ASRD (Anchor Supervised Revocable Decoding), a training-free framework for improving decoding quality in diffusion large language models. The method addresses error propagation and local error reinforcement in revocable decoding by separating trusted 'anchor tokens' (identified via temporal consistency) from uncertain candidates, then applying anchor-guided generation and anchor-perturbed verification. Experiments on math and coding benchmarks show up to 6.4% accuracy improvement and 7.2× inference throughput gains over remasking baselines.

Inference Economics ASRD Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

4arXiv · cs.CL·7d ago·source ↗

Variance-Calibrated Modulation (VCM): training-free decoding intervention to address LLM likelihood trap

Researchers propose Variance-Calibrated Modulation (VCM), a training-free pre-decoding method that reshapes LLM probability distributions before truncation to combat repetitive degeneration and vocabulary dullness. VCM combines two mechanisms: Contextual Searchlight via PMI (suppressing stopwords, elevating context-relevant tokens) and Adaptive Self-Debiasing (scale-invariant penalization using real-time logit standard deviation). Evaluated across open-ended generation, factual QA, and mathematical reasoning, VCM improves diversity, coherence, and reasoning accuracy at higher temperatures with negligible overhead. The method is compatible with existing decoding strategies like Top-p and Min-p.

Evaluation and Benchmarking Inference Economics Adaptive Self-Debiasing Variance-Calibrated Modulation Contextual Searchlight via PMI

6arXiv · cs.CL·29d ago·source ↗

Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding

This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.

Frontier Model Releases Evaluation and Benchmarking BLEU-4 Graph Transformer Diffusion Language Models +5 more

5arXiv · cs.CL·6d ago·source ↗

FMLM+ introduces Posterior Refinement for fast non-autoregressive language generation

Researchers introduce FMLM+, a framework combining Flow Map Language Models with masking-style noise schedules to enable joint sequence generation with per-token global consistency scoring. The key contribution is Posterior Refinement, an inference-time self-correction strategy that matches discrete baseline performance with 32x fewer neural function evaluations (NFEs). The approach improves the speed-quality tradeoff over both Masked Diffusion Models and standard FLMMs across multiple benchmarks, addressing longstanding factorization error problems in non-autoregressive generation.

Frontier Model Releases Inference Economics Posterior Refinement Flow Map Language Models FMLM++2 more

6Hugging Face Blog·1mo ago·source ↗

StarCoder: A State-of-the-Art LLM for Code

Hugging Face and ServiceNow released StarCoder, a large language model for code trained on permissively licensed data from The Stack dataset. The model targets code generation, completion, and understanding tasks and is positioned as an open-weights alternative to proprietary code models. The release includes model weights, training details, and an associated technical report.

Open Weights Progress Agent and Tool Ecosystem ServiceNow AI BigCode The Stack v2 +2 more

5arXiv · cs.AI·14h ago·source ↗

LeVo 2: Hybrid LLM-Diffusion framework for stable full-length song generation with hierarchical modeling

LeVo 2 is a new hybrid LLM-Diffusion system for controllable full-length song generation that addresses the coherence-vs-acoustics trade-off through hierarchical token prediction: a language model handles semantic planning via mixed tokens, then predicts vocal and accompaniment tracks in parallel, while a diffusion-based codec reconstructs waveforms. A key contribution is an aesthetics-guided progressive post-training schedule combining SFT, offline DPO, and semi-online DPO to separately optimize quality, controllability, and musicality. Expert listening tests show LeVo 2 outperforms open-source baselines across six subjective dimensions and approaches leading commercial systems on several metrics.

Alignment and RLHF Multimodal Progress LeVo 2 Direct Preference Optimization (DPO)