5arXiv cs.AI (Artificial Intelligence)·21h ago

Solver-dependent Nash equilibrium selection on zero-sum polytopes: regularized methods select max-entropy members

A new arXiv preprint investigates whether different Nash equilibrium solvers systematically select different members of the Nash polytope in two-player zero-sum games. Using six analytically tractable games including Kuhn poker, the authors find that regularized last-iterate methods (R-NaD, magnetic mirror descent) converge to the maximum-entropy Nash equilibrium — interpretable as an information projection — while regret-averaging methods (CFR, CFR+, fictitious play) drift to lower-entropy boundary solutions. The distinction has downstream consequences for performance against sub-optimal opponents in games with sequential or hidden-information structure, with implications for multi-agent AI training and game-solving pipelines.

Evaluation and Benchmarking CFR Kuhn poker R-NaD Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes magnetic mirror descent

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.LG·24d ago·source ↗

DNQ: Deep Nash Q-Network framework for equilibrium learning in multi-agent bidding games

Researchers propose DNQ (Deep Nash Q-Network), a solver-in-the-loop framework for training agents to reach Nash equilibria in partially observable n-player simultaneous bidding games. The method alternates between trajectory collection, critic-based payoff estimation, external equilibrium computation, and policy imitation via KL divergence minimization. A scalable pairwise payoff formulation is shown to outperform the exact N-player tensor approach in computational cost while maintaining strategic quality, with experiments demonstrating the trade-off between fidelity and scalability as agent count grows.

DNQ DNQ: Deep Nash Q-Network for Partially Observable n-Player Games

4arXiv · cs.LG·24d ago·source ↗

Repeated Policy Regret (RP-Regret): Regret minimization against adaptive opponents in repeated games

This arXiv paper introduces Repeated Policy Regret (RP-Regret), a new game-theoretic metric for regret minimization in repeated games where opponents can adapt based on play history — a setting where standard external regret fails. The authors prove necessary conditions for sublinear RP-Regret and propose three algorithms to minimize it, including oracle-based, linearized surrogate, and slow-opponent variants. When all players minimize RP-Regret, certain subgame perfect equilibria can be learned, and experiments show more cooperative outcomes in games like Stag-Hunt.

Evaluation and Benchmarking Repeated Policy Regret (RP-Regret)Regret Minimization with Adaptive Opponents in Repeated Games

4arXiv · cs.AI·1mo ago·source ↗

Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

This paper analyzes preference-shaped expected improvement criteria for Bayesian multiobjective optimization, focusing on hypervolume (EHVI) and R2 indicator families. The authors establish which preference transformations preserve exact computation, Pareto compatibility, and monotonicity, and which alter the underlying geometry. A key result is that exact integral R2 improvement is not generally an objective-space weighted hypervolume but is exactly a scalarization-space volume (Tchebycheff shadow measure), enabling new finite-sum and quadrature algorithms for ER2I. The work also provides an achievement-space Gaussian surrogate formulation reducing ER2I to an integral of scalar Gaussian expected improvements.

Evaluation and Benchmarking Tchebycheff Scalarization Bayesian Multiobjective Optimization Expected Hypervolume Improvement (EHVI)+2 more

7arXiv · cs.LG·1mo ago·source ↗

Entropy-Cut Metropolis-Hastings: Sampling-Based Reasoning Without RL Training

This paper introduces Entropy-Cut Metropolis-Hastings (ECMH), an algorithm that samples from a 'power distribution' over base language model outputs to elicit strong reasoning without reinforcement learning posttraining. Rather than cutting reasoning traces at uniformly random positions, ECMH uses next-token entropy as a proxy to identify consequential decision points (e.g., choice of proof strategy), then resamples from those positions. The authors prove that mixing time scales with the number of decisions rather than tokens, and demonstrate consistent improvements over RL-trained models on MATH500, HumanEval, GPQA Diamond, and AIME26.

Frontier Model Releases Evaluation and Benchmarking power distribution MATH500 Entropy-Cut Metropolis-Hastings +6 more

5arXiv · cs.LG·26d ago·source ↗

Reward uncertainty as a principled mechanism for diverse RL behaviour

A new arXiv preprint proposes replacing the scalar reward in RL with a distribution over reward functions, applying a non-linear objective over sets of actions to induce calibrated behavioural diversity without sacrificing expected reward. The authors derive a principled gradient estimator in the contextual bandit setting and prove the formulation generalizes vanilla policy gradient and action-set approaches. The work is motivated by applications like language model fine-tuning where diversity is desirable but entropy regularization and diversity bonuses introduce fragile trade-offs. Empirical results support the framework as a theoretically grounded alternative to heuristic diversity methods.

Evaluation and Benchmarking Alignment and RLHF Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

7arXiv · cs.LG·1mo ago·source ↗

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

This paper introduces Equilibrium Reasoners (EqR), a framework that formalizes test-time compute scaling through learned task-conditioned attractors in latent space, where stable fixed points correspond to valid solutions. EqR scales along two axes—depth (more iterations) and breadth (aggregating stochastic trajectories)—without requiring external verifiers or task-specific priors. On Sudoku-Extreme, unrolling up to 40,000 equivalent layers boosts accuracy from 2.6% (feedforward baseline) to over 99%. The work provides a mechanistic lens for understanding why iterative latent models generalize beyond memorized patterns.

Long Context Evolution Evaluation and Benchmarking task-conditioned attractors latent dynamical systems Sudoku-Extreme +3 more

5arXiv · cs.CL·11d ago·source ↗

Multi-Agent Fictitious Play (MAFP) applies game-theoretic equilibrium-seeking to LLM decision-making

Researchers propose Multi-Agent Fictitious Play (MAFP), a multi-agent system paradigm that frames LLM-based decision-making as an equilibrium-seeking process borrowed from game theory. Each agent represents a stakeholder stance and iteratively best-responds to the empirical mixture of other agents' past decisions, addressing what the authors call 'stance entanglement' — mutual interdependence among stakeholder decisions that cannot be decomposed into independent subtasks. MAFP is evaluated on competitive strategy tasks and outperforms single-round and multi-round baselines on tournament strength and robustness metrics. The work extends the MAS literature beyond divide-and-conquer execution patterns into interdependent decision scenarios.

Evaluation and Benchmarking Agent and Tool Ecosystem Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play Multi-Agent Fictitious Play

5arXiv · cs.AI·26d ago·source ↗

VEPO: Vision-anchored token selection improves RL for visual reasoning

A new arXiv paper identifies a failure mode of entropy-based credit assignment in multimodal reinforcement learning: vision-sensitive tokens with naturally low entropy are systematically ignored, causing the mechanism to collapse in visual reasoning tasks. The authors propose VEPO (Vision-Entropy token-selection for Policy Optimization), which couples visual sensitivity with token entropy via a multiplicative scheme to redirect gradient credit toward tokens that are both visually grounded and semantically informative. VEPO outperforms entropy-only baselines by 2.28 points at 7B scale and 3.15 points at 3B scale on visual reasoning benchmarks.

Alignment and RLHF Multimodal Progress VEPO Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection