ACTION-BED: Task-driven Bayesian experimental design via singly intractable expected future loss objectives
A new arXiv preprint proposes ACTION-BED, a reformulation of Bayesian experimental design (BED) that replaces the traditional doubly intractable expected information gain objective with an expected future loss (EFL) on downstream actions. The authors show all such EFLs can be rearranged into singly intractable objectives jointly optimizable over design and action policies via stochastic gradients, eliminating the need for explicit posterior or marginal likelihood estimation. The method is claimed to be more efficient and customizable to downstream tasks than existing BED approaches.
Related guides (1)
Related events (8)
GoBOED: Goal-Driven Bayesian Optimal Experimental Design for Decision-Focused Robustness
GoBOED is a new framework for Bayesian optimal experimental design (BOED) that replaces information-gain maximization with direct optimization for a specified downstream decision objective. It combines an amortized variational posterior surrogate with a differentiable convex decision layer to enable gradient-based, decision-focused design optimization. The authors prove that GoBOED gradients are insensitive to parameter directions irrelevant to the decision goal, formally justifying why goal-driven design achieves equivalent decision quality over a wider range of experimental designs. Empirical results across source localization, epidemic management, and pharmacokinetic control show improved alignment with decision objectives compared to goal-agnostic BOED.
HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies
Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.
AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning
This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.
Reward uncertainty as a principled mechanism for diverse RL behaviour
A new arXiv preprint proposes replacing the scalar reward in RL with a distribution over reward functions, applying a non-linear objective over sets of actions to induce calibrated behavioural diversity without sacrificing expected reward. The authors derive a principled gradient estimator in the contextual bandit setting and prove the formulation generalizes vanilla policy gradient and action-set approaches. The work is motivated by applications like language model fine-tuning where diversity is desirable but entropy regularization and diversity bonuses introduce fragile trade-offs. Empirical results support the framework as a theoretically grounded alternative to heuristic diversity methods.
UBP2: Model-based preference RL with uncertainty-balanced exploration achieves sublinear regret
UBP2 (Uncertainty-Balanced Preference Planning) is a model-based reinforcement learning method that improves sample efficiency in preference-based RL by jointly reasoning over uncertainties in reward, dynamics, and value functions. The approach uses ensembles to score candidate trajectories and provides a principled exploitation-exploration tradeoff without ad hoc heuristics. The authors prove sublinear regret guarantees for finite- and infinite-horizon settings and demonstrate substantially better sample efficiency than model-free baselines on the Meta-World benchmark.
RevengeBench: Benchmark for Reconstructing Agent Decision Programs from Behavioral Observations
RevengeBench is a new benchmark of 75 LLM-generated, Elo-calibrated policies across five game environments that tests whether a learner can reconstruct a hidden agent's decision program as executable code from behavioral traces alone. The benchmark draws from CodeClash tournament trajectories and allows the learner to design controlled behavioral probes (custom opponent policies) to elicit informative behavior before submitting an executable hypothesis. Evaluated across twelve frontier LLMs, recovery quality ranges from 34 to 72% of initial action-distance closed, with reconstructed policies providing measurable competitive advantage especially for weaker models. The work frames policy reconstruction as a tractable inverse problem in code-space, with implications for opponent modeling and policy interpretability.
RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL
A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.
PAC-Bayes analysis establishes formal expressivity and alignment floors for prompt-conditioned LLMs
A new arXiv preprint models user-LLM interaction as a bilevel cheap-talk game and derives PAC-Bayes bounds showing two irreducible limitations: an 'expressivity floor' where language's finite channel capacity makes distinct tasks indistinguishable, and an 'objective-misalignment floor' where alignment constraints prevent reaching user-ideal outputs. The authors prove that prompt-conditioned LLMs cannot be universal problem solvers, as correct behavior on certain task families is provably unattainable even with infinite data, optimal training, or model scaling. The work suggests multimodal inputs and external memory as potential mitigations by increasing task-relevant information bandwidth.
