5arXiv cs.LG (Machine Learning)·25d ago

GoBOED: Goal-Driven Bayesian Optimal Experimental Design for Decision-Focused Robustness

GoBOED is a new framework for Bayesian optimal experimental design (BOED) that replaces information-gain maximization with direct optimization for a specified downstream decision objective. It combines an amortized variational posterior surrogate with a differentiable convex decision layer to enable gradient-based, decision-focused design optimization. The authors prove that GoBOED gradients are insensitive to parameter directions irrelevant to the decision goal, formally justifying why goal-driven design achieves equivalent decision quality over a wider range of experimental designs. Empirical results across source localization, epidemic management, and pharmacokinetic control show improved alignment with decision objectives compared to goal-agnostic BOED.

Evaluation and Benchmarking Agent and Tool Ecosystem GoBOED differentiable convex optimization amortized variational inference Bayesian Optimal Experimental Design

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·11d ago·source ↗

GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment

Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.

Inference Economics Alignment and RLHF Best-of-N Sampling Gradient-Guided Reward Optimization

4arXiv · cs.LG·1mo ago·source ↗

Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization

This paper addresses miscalibration in Gaussian process predictive distributions used for Bayesian optimization, focusing specifically on the lower tail relevant to minimization objectives. The authors introduce a framework for 'goal-oriented' spatial calibration below a threshold t, defining occurrence calibration and thresholded μ-calibration on sublevel sets. They propose tcGP, a post-hoc calibration method, and prove the resulting EI-based optimizer remains dense in the design space. Experiments on standard benchmarks show tcGP improves both lower-tail calibration and overall BO performance compared to standard and globally calibrated GP models.

Evaluation and Benchmarking Agent and Tool Ecosystem Gaussian Process Expected Improvement Bayesian Optimization +2 more

6arXiv · cs.LG·18d ago·source ↗

Drifting Preference Optimization (DrPO) for One-Step Text-to-Image Generators

DrPO is a new online preference fine-tuning method designed specifically for deterministic one-step text-to-image generators like SD-Turbo and SDXL-Turbo, which are difficult to align with standard RLHF methods that require policy likelihoods or differentiable reward gradients. The method samples candidates per prompt, ranks them with a target reward, and synthesizes a feature-space update direction via a non-parametric dipole preference field plus a reference drift from the frozen base model. Because the reward is used only for ranking, DrPO supports black-box and non-differentiable reward functions while keeping inference as a single forward pass. Evaluations on HPSv3 and GenEval show improved alignment over reward-gradient-free baselines and a 3.51× reduction in training compute by eliminating reward-model backpropagation.

Inference Economics Alignment and RLHF SDXL Turbo HPSv3 GenEval +4 more

4arXiv · cs.LG·12d ago·source ↗

MG-ADSGD achieves optimal communication complexity for decentralized stochastic strongly convex optimization

Researchers propose Multi-Gossip Accelerated DSGD (MG-ADSGD), a decentralized stochastic optimization algorithm that simultaneously achieves accelerated dependence on both the condition number (√κ) and the network spectral gap (1/√(1-β)), a combination no prior stochastic method had attained. The algorithm couples gossip depth with mini-batch size so that additional communication rounds improve both consensus accuracy and gradient variance reduction. The resulting communication complexity is claimed to be the best currently known for decentralized stochastic strongly convex optimization up to logarithmic factors.

Training Infrastructure Multi-Gossip Accelerated DSGD Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

7arXiv · cs.CL·23d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more

5arXiv · cs.AI·1mo ago·source ↗

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

This paper introduces an agentic framework where an LLM acts as an operations research expert, translating natural-language user prompts into structured updates ('patches') to deployed optimization models and selecting appropriate re-optimization techniques from a toolbox. The toolbox leverages primal information—historical solutions, valid inequalities, solver configurations, and metaheuristics—to accelerate re-optimization while preserving solution quality. Experiments on supply chain re-optimization and university exam scheduling demonstrate computational efficiency gains and improved interpretability through patch-based model modifications. The framework aims to reduce dependence on OR experts for maintaining dynamic decision-support systems.

Enterprise Deployment Patterns Agent and Tool Ecosystem LLM-Guided Model Patches agentic re-optimization framework supply chain re-optimization +2 more

6arXiv · cs.CL·2d ago·source ↗

GraphPO: Graph-based Policy Optimization reduces redundancy in LLM reasoning RL

GraphPO is a new reinforcement learning framework that represents reasoning rollouts as directed acyclic graphs rather than independent chains or trees, merging semantically equivalent reasoning paths into equivalence classes to share suffixes and reduce redundant exploration. The approach assigns efficiency advantages to incoming edges and correctness advantages to outgoing edges, deriving process supervision from outcome rewards. Experiments on three LLMs across reasoning and agentic search benchmarks show consistent improvements over chain- and tree-based baselines under equal token or response budgets. The method also provides theoretical guarantees on reduced advantage-estimation variance.

Frontier Model Releases Alignment and RLHF GraphPO GraphPO: Graph-based Policy Optimization for Reasoning Models

5arXiv · cs.CL·9d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.

Alignment and RLHF AGDO Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models