4arXiv cs.LG (Machine Learning)·3d ago

SDE approximation for TD learning with linear features under Markovian noise

A new arXiv preprint replaces the classical ODE description of linear TD(0) learning with a stochastic differential equation (SDE) approximation that accounts for Markovian sampling noise. The model separates contraction dynamics governed by the projected Bellman operator from the influence of Markovian long-run covariance, providing a theoretical explanation for the constant-stepsize error floor. The work is a theoretical contribution to the foundations of reinforcement learning policy evaluation.

Alignment and RLHF TD(0)A Diffusion Approximation for Temporal-Difference Learning with Linear Features under Markovian Noise

Related guides (1)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·3d ago·source ↗

Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control

Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.

Agent and Tool Ecosystem Hamilton-Jacobi reachability Kolmogorov Regression for Robust Diffusion Policies PushT +1 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL

A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.

Evaluation and Benchmarking Agent and Tool Ecosystem Leslie Pack Kaelbling Divide-and-Conquer Value Learning Berkeley AI Research (BAIR)+8 more

5Hugging Face Blog·1mo ago·source ↗

Finetune Stable Diffusion Models with DDPO via TRL

Hugging Face's TRL library adds support for DDPO (Denoising Diffusion Policy Optimization), enabling reinforcement learning-based finetuning of Stable Diffusion models. This extends TRL's RLHF tooling beyond language models to image generation, allowing reward-driven optimization of diffusion models. The post demonstrates practical usage of the new DDPO trainer within the TRL ecosystem.

Agent and Tool Ecosystem Alignment and RLHF DDPO Denoising Diffusion Policy Optimization Stable Diffusion 3 +3 more

4arXiv · cs.AI·11d ago·source ↗

PTL-Diffusion: Diffusion framework with periodic terminal laws for manifold-aware generation

PTL-Diffusion is a new diffusion modeling framework that replaces the standard single Gaussian terminal distribution with a periodic family of Gaussian terminal laws, embedding phase structure directly into the forward noising dynamics rather than only in the denoising network. The authors derive closed-form forward marginals and reverse posteriors for a periodically forced Ornstein-Uhlenbeck process, enabling standard noise-prediction training. Experiments on torus, cylinder, and face datasets show improvements in manifold-level distributional matching over DDPM baselines. The work is a proof-of-concept motivating structured terminal reference laws as a direction for geometry-aware generative modeling.

Evaluation and Benchmarking Denoising Diffusion Probabilistic Models Olivetti Faces Dataset PTL-Diffusion

5arXiv · cs.LG·8d ago·source ↗

Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates

A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.

Evaluation and Benchmarking Alignment and RLHF on-policy distillation AdamW Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

4arXiv · cs.LG·12d ago·source ↗

Second-order path kernel interpolation formulas extend Domingos' gradient-descent characterization

This paper extends Pedro Domingos' 2020 first-order path-kernel interpolation formula for gradient-descent-trained models to second-order forms. The authors derive curvature-weighted correction terms for standard SGD, an additional sampling-induced component coupling prediction curvature with mini-batch gradient noise covariance, and an extension to SGD with momentum. A concentration estimate for the terminal prediction is also established, quantifying fluctuation around the expected second-order representation.

Pedro Domingos Second-Order Path Kernel Interpolation Formulas in Machine Learning

5arXiv · cs.LG·26d ago·source ↗

Perturbation Theory for Spherical Hellinger-Kantorovich Flows with Differential Privacy Guarantees

This paper develops a perturbation theory for Spherical Hellinger-Kantorovich (SHK) gradient flows, which couple transport and reaction dynamics and coincide with birth-death Langevin dynamics. The authors derive dimension-free bounds on log-likelihood ratios and Rényi/KL divergences when two potentials differ, quantifying how perturbations propagate over time. These results are applied to differential privacy: the likelihood-ratio control yields explicit Pure-DP guarantees for SHK-based samplers implementing the exponential mechanism, while KL bounds provide Approximate-DP certificates. A utility bound is also derived that separates intrinsic exponential-mechanism suboptimality from finite-time sampling error.

AI Safety Research Alignment and RLHF Differential Privacy KL Divergence Spherical Hellinger-Kantorovich geometry +4 more

5arXiv · cs.CL·3d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

Frontier Model Releases Alignment and RLHF d-OPSD Learning from the Self-future: On-policy Self-distillation for dLLMs