Almanac
← Events
4arXiv cs.LG (Machine Learning)·11d ago

Unified MAIR framework bridges GP-UCB and DEC approaches in kernel bandits

A new arXiv preprint unifies two major theoretical frameworks for frequentist RKHS bandits — Gaussian-process upper confidence bound (GP-UCB) and decision-estimation-coefficient (DEC) methods — under a common algorithmic-information language called MAIR. The paper generalizes both the GP-UCB analysis and the MAMS algorithm, proposes a safeguarded master algorithm combining their advantages, and demonstrates that algorithmic complexity can be more informative than class-wide minimax certificates in overparameterized models. The work clarifies a foundational distinction between algorithmic information and minimax coefficients in bandit theory.

Related guides (1)

Related events (8)

4arXiv · cs.AI·6d ago·source ↗

PCMA: Learning coordinated agent-specific preferences for multi-objective multi-agent RL

A new arXiv preprint introduces Preference Coordinated Multi-agent Policy Optimization (PCMA), a method for cooperative multi-objective multi-agent reinforcement learning (MOMARL) that learns agent-specific preferences to enable complementary trade-offs across agents. The authors formulate cooperative MOMARL as a team-optimal game and provide a first-order improvement decomposition showing that preference diversity can induce team improvement. Experiments on cooperative MOMA environments and a traffic-control scenario demonstrate improvements in both performance and trade-off coordination.

6arXiv · cs.AI·5d ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

4arXiv · cs.AI·3d ago·source ↗

UBP2: Model-based preference RL with uncertainty-balanced exploration achieves sublinear regret

UBP2 (Uncertainty-Balanced Preference Planning) is a model-based reinforcement learning method that improves sample efficiency in preference-based RL by jointly reasoning over uncertainties in reward, dynamics, and value functions. The approach uses ensembles to score candidate trajectories and provides a principled exploitation-exploration tradeoff without ad hoc heuristics. The authors prove sublinear regret guarantees for finite- and infinite-horizon settings and demonstrate substantially better sample efficiency than model-free baselines on the Meta-World benchmark.

5arXiv · cs.LG·18d ago·source ↗

Reward uncertainty as a principled mechanism for diverse RL behaviour

A new arXiv preprint proposes replacing the scalar reward in RL with a distribution over reward functions, applying a non-linear objective over sets of actions to induce calibrated behavioural diversity without sacrificing expected reward. The authors derive a principled gradient estimator in the contextual bandit setting and prove the formulation generalizes vanilla policy gradient and action-set approaches. The work is motivated by applications like language model fine-tuning where diversity is desirable but entropy regularization and diversity bonuses introduce fragile trade-offs. Empirical results support the framework as a theoretically grounded alternative to heuristic diversity methods.

5arXiv · cs.LG·2d ago·source ↗

Optimal deterministic multicalibration achieved, resolving open problem on randomization necessity

A new arXiv preprint resolves an open problem in multicalibration theory by constructing a minimax-optimal multicalibration algorithm that outputs a deterministic predictor, achieving the same O(ε⁻³) sample complexity previously only attainable by randomized predictors. The result extends to outcome indistinguishability, deterministic omnipredictors, and panpredictors with optimal sample complexity, resolving multiple open problems from recent works. Multicalibration is a fairness and reliability property requiring calibration to hold across reweighted subgroups, making this relevant to trustworthy ML research.

5arXiv · cs.CL·3d ago·source ↗

Multi-Agent Fictitious Play (MAFP) applies game-theoretic equilibrium-seeking to LLM decision-making

Researchers propose Multi-Agent Fictitious Play (MAFP), a multi-agent system paradigm that frames LLM-based decision-making as an equilibrium-seeking process borrowed from game theory. Each agent represents a stakeholder stance and iteratively best-responds to the empirical mixture of other agents' past decisions, addressing what the authors call 'stance entanglement' — mutual interdependence among stakeholder decisions that cannot be decomposed into independent subtasks. MAFP is evaluated on competitive strategy tasks and outperforms single-round and multi-round baselines on tournament strength and robustness metrics. The work extends the MAS literature beyond divide-and-conquer execution patterns into interdependent decision scenarios.

4arXiv · cs.LG·1mo ago·source ↗

Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization

This paper addresses miscalibration in Gaussian process predictive distributions used for Bayesian optimization, focusing specifically on the lower tail relevant to minimization objectives. The authors introduce a framework for 'goal-oriented' spatial calibration below a threshold t, defining occurrence calibration and thresholded μ-calibration on sublevel sets. They propose tcGP, a post-hoc calibration method, and prove the resulting EI-based optimizer remains dense in the design space. Experiments on standard benchmarks show tcGP improves both lower-tail calibration and overall BO performance compared to standard and globally calibrated GP models.

7arXiv · cs.CL·1mo ago·source ↗

General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks

GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).