4arXiv cs.LG (Machine Learning)·Jun 5, 2026

DNQ: Deep Nash Q-Network framework for equilibrium learning in multi-agent bidding games

Researchers propose DNQ (Deep Nash Q-Network), a solver-in-the-loop framework for training agents to reach Nash equilibria in partially observable n-player simultaneous bidding games. The method alternates between trajectory collection, critic-based payoff estimation, external equilibrium computation, and policy imitation via KL divergence minimization. A scalable pairwise payoff formulation is shown to outperform the exact N-player tensor approach in computational cost while maintaining strategic quality, with experiments demonstrating the trade-off between fidelity and scalability as agent count grows.

DNQ DNQ: Deep Nash Q-Network for Partially Observable n-Player Games

Related events (8)

5arXiv · cs.AI·Jun 29, 2026·source ↗

Solver-dependent Nash equilibrium selection on zero-sum polytopes: regularized methods select max-entropy members

A new arXiv preprint investigates whether different Nash equilibrium solvers systematically select different members of the Nash polytope in two-player zero-sum games. Using six analytically tractable games including Kuhn poker, the authors find that regularized last-iterate methods (R-NaD, magnetic mirror descent) converge to the maximum-entropy Nash equilibrium — interpretable as an information projection — while regret-averaging methods (CFR, CFR+, fictitious play) drift to lower-entropy boundary solutions. The distinction has downstream consequences for performance against sub-optimal opponents in games with sequential or hidden-information structure, with implications for multi-agent AI training and game-solving pipelines.

Evaluation and Benchmarking CFR Kuhn poker R-NaD +2 more

4arXiv · cs.LG·Jul 13, 2026·source ↗

Semantic Pareto-DQN: Multi-Objective RL with LLM State Representations for Financial Fraud Detection

Researchers propose Semantic Pareto-DQN, a multi-objective reinforcement learning framework that addresses class imbalance in financial anomaly detection without data resampling. The approach encodes heterogeneous transaction features as natural-language narratives via LLMs to produce scale-invariant state representations, then optimizes a vectorial reward that decouples fraud detection, false-positive friction, and semantic discovery across the Pareto frontier. Empirical results on E-Commerce fraud and UCI Credit datasets show improved minority-class recall over scalarized baselines, avoiding the 'fraud collapse' failure mode.

Evaluation and Benchmarking Semantic Pareto-DQN UCI Credit dataset

3arXiv · cs.AI·Jul 16, 2026·source ↗

DVM-HALL model and Net Human-Agent Score proposed for AI agent loyalty dynamics in autonomous commerce

A preprint from arXiv introduces the Dynamic Verifiable Multi-Agent Human Agentic Loyalty Loop (DVM-HALL) model, a theoretical framework for understanding brand loyalty when AI agents autonomously execute purchasing decisions on behalf of humans. The model formalizes brand selection via a softmax formulation incorporating emotional equity, agentic utility, trust, delegated authority, and verifiable execution, with recursive trust-updating mechanisms. It also introduces the Net Human-Agent Score (NHAS), a risk-weighted metric for measuring human-agent alignment using feedback logs and verifiable receipts. The framework extends into DeFi and tokenized loyalty settings, treating execution risks like gas costs and MEV exposure as predictors of agentic brand preference.

Enterprise Deployment Patterns Agent and Tool Ecosystem Net Human-Agent Score Dynamic Verifiable Multi-Agent Human Agentic Loyalty Loop

6arXiv · cs.CL·5d ago·source ↗

Skill Self-Play: Co-evolving LLM capabilities via structured self-play with dynamic skill routing

Researchers introduce Skill Self-Play (Skill-SP), a reinforcement learning framework that addresses the diversity-vs-verifiability dilemma in LLM self-evolution by using agent skills as a middle ground. The system comprises a proposer, solver, and dynamic skill controller that co-evolve in a continuous loop: the proposer generates tasks conditioned on sampled skills, the solver explores solutions, and the skill controller updates an expanding skill library based on execution feedback. Evaluations on tool-use and reasoning benchmarks show consistent performance gains on capable backbones and recovery for initially misaligned models. Code is released under the Qwen-Applications GitHub organization, suggesting Alibaba/Qwen team involvement.

Frontier Model Releases Agent and Tool Ecosystem Skill Self-Play: Pushing the Frontier of LLM Capability with Co-Evolving Skills Alibaba Qwen +2 more

5arXiv · cs.LG·May 25, 2026·source ↗

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS is a three-layer multi-agent architecture addressing temporal degradation in knowledge-graph data marketplaces, combining neural-ODE-based shortcut decay, changepoint-conditioned Shapley pricing, and EXP3-IX-driven differential privacy budget management. The system achieves 0.937 recall@10, 2.74 QPS, and 161ms latency under a total epsilon of 4.25 (delta=1e-6) using zCDP composition across four benchmarks. A key limitation noted is that at this privacy level, released valuations remain noise-dominated, with utility primarily derived from public index routing. The work provides formal guarantees including per-query recall-loss bounds and finite-sample Shapley error bounds under distribution shift.

Evaluation and Benchmarking AI Safety Research Differential Privacy CHRONOS Gaussian mechanism +6 more

5arXiv · cs.CL·Jun 18, 2026·source ↗

Multi-Agent Fictitious Play (MAFP) applies game-theoretic equilibrium-seeking to LLM decision-making

Researchers propose Multi-Agent Fictitious Play (MAFP), a multi-agent system paradigm that frames LLM-based decision-making as an equilibrium-seeking process borrowed from game theory. Each agent represents a stakeholder stance and iteratively best-responds to the empirical mixture of other agents' past decisions, addressing what the authors call 'stance entanglement' — mutual interdependence among stakeholder decisions that cannot be decomposed into independent subtasks. MAFP is evaluated on competitive strategy tasks and outperforms single-round and multi-round baselines on tournament strength and robustness metrics. The work extends the MAS literature beyond divide-and-conquer execution patterns into interdependent decision scenarios.

Evaluation and Benchmarking Agent and Tool Ecosystem Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play Multi-Agent Fictitious Play

6arXiv · cs.AI·Jun 12, 2026·source ↗

DoorDash deploys multi-agent RL system for adaptive dispatch objective weights in food-delivery marketplace

Researchers at DoorDash present a deployed reinforcement learning system that adapts dispatch objective weights in a three-sided food-delivery marketplace using delayed operational feedback signals. Rather than replacing the combinatorial optimizer, a store-level policy selects discrete multipliers that shift the optimizer's tradeoff between delivery quality and batching efficiency. The system uses centralized offline training with Double Q-learning and a conservative regularizer to handle out-of-distribution overestimation, then executes decentrally per store. A production switchback experiment shows increased batching and reduced courier time costs without degrading customer delivery quality.

Enterprise Deployment Patterns Agent and Tool Ecosystem Double Q-learning DoorDash Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

5arXiv · cs.CL·Jul 13, 2026·source ↗

Agora: Auction-based task allocation framework for LLM agent reasoning

Researchers introduce Agora, a framework that uses an incentive-compatible auction mechanism to dynamically route reasoning subtasks to the most capable expert models or tools, rather than relying on coarse-grained function matching. Agents bid based on 'rectified competence' to prevent overconfident solvers from capturing critical logic steps. Evaluations across five benchmarks show improvements over single-model, routing, and cascade baselines, with a controllable cost-quality trade-off via a single auction parameter.

Evaluation and Benchmarking Agent and Tool Ecosystem AGORA

DNQ: Deep Nash Q-Network framework for equilibrium learning in multi-agent bidding games

Related events (8)

5arXiv · cs.AI·Jun 29, 2026·source ↗