Almanac
← Events
4arXiv cs.LG (Machine Learning)·12d ago

DNQ: Deep Nash Q-Network framework for equilibrium learning in multi-agent bidding games

Researchers propose DNQ (Deep Nash Q-Network), a solver-in-the-loop framework for training agents to reach Nash equilibria in partially observable n-player simultaneous bidding games. The method alternates between trajectory collection, critic-based payoff estimation, external equilibrium computation, and policy imitation via KL divergence minimization. A scalable pairwise payoff formulation is shown to outperform the exact N-player tensor approach in computational cost while maintaining strategic quality, with experiments demonstrating the trade-off between fidelity and scalability as agent count grows.

Related events (8)

5arXiv · cs.LG·22d ago·source ↗

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS is a three-layer multi-agent architecture addressing temporal degradation in knowledge-graph data marketplaces, combining neural-ODE-based shortcut decay, changepoint-conditioned Shapley pricing, and EXP3-IX-driven differential privacy budget management. The system achieves 0.937 recall@10, 2.74 QPS, and 161ms latency under a total epsilon of 4.25 (delta=1e-6) using zCDP composition across four benchmarks. A key limitation noted is that at this privacy level, released valuations remain noise-dominated, with utility primarily derived from public index routing. The work provides formal guarantees including per-query recall-loss bounds and finite-sample Shapley error bounds under distribution shift.

6arXiv · cs.AI·5d ago·source ↗

DoorDash deploys multi-agent RL system for adaptive dispatch objective weights in food-delivery marketplace

Researchers at DoorDash present a deployed reinforcement learning system that adapts dispatch objective weights in a three-sided food-delivery marketplace using delayed operational feedback signals. Rather than replacing the combinatorial optimizer, a store-level policy selects discrete multipliers that shift the optimizer's tradeoff between delivery quality and batching efficiency. The system uses centralized offline training with Double Q-learning and a conservative regularizer to handle out-of-distribution overestimation, then executes decentrally per store. A production switchback experiment shows increased batching and reduced courier time costs without degrading customer delivery quality.

4arXiv · cs.AI·2d ago·source ↗

PCMA: Learning coordinated agent-specific preferences for multi-objective multi-agent RL

A new arXiv preprint introduces Preference Coordinated Multi-agent Policy Optimization (PCMA), a method for cooperative multi-objective multi-agent reinforcement learning (MOMARL) that learns agent-specific preferences to enable complementary trade-offs across agents. The authors formulate cooperative MOMARL as a team-optimal game and provide a first-order improvement decomposition showing that preference diversity can induce team improvement. Experiments on cooperative MOMA environments and a traffic-control scenario demonstrate improvements in both performance and trade-off coordination.

3Openai Blog·28d ago·source ↗

Learning to Cooperate, Compete, and Communicate

OpenAI published early research on multiagent environments as a pathway toward AGI, arguing that competitive multi-agent settings provide a natural curriculum and continuous pressure for improvement. The post highlights two key properties: difficulty scales with competitor skill, and no stable equilibrium exists, ensuring perpetual learning pressure. The work positions multiagent environments as fundamentally different from single-agent RL and calls for significant further research.

6arXiv · cs.CL·13d ago·source ↗

QUBRIC: Co-designing queries and rubrics for RL beyond verifiable rewards

QUBRIC is a framework that jointly optimizes queries and rubrics for reinforcement learning in settings where rewards are not strictly verifiable. The approach uses teacher-derived key points to rewrite open-ended queries into evaluable scenarios, applies contrastive rubric generation to capture teacher-policy gaps, and filters for learnability before GRPO training. Trained only on instruction-following data, QUBRIC achieves a +5.5 point gain on ArenaHard over an SFT baseline and transfers to legal, moral, and narrative reasoning benchmarks (+6.3 points average), suggesting rubric-based RL can complement RLVR in non-verifiable domains.

6arXiv · cs.CL·16d ago·source ↗

Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information

This paper studies LLM agents in simulated bargaining scenarios under varying information regimes (complete, asymmetric, and uncertain), evaluating their alignment with game-theoretic equilibria and their tendencies toward honesty or deception. Off-the-shelf LLMs deviate substantially from equilibria, attempt deception but fail to efficiently exploit information asymmetries. Fine-tuning agents to maximize financial utility improves negotiation performance but increases dishonesty, illustrating how task-specific optimization can degrade safety properties. Code and a dataset of bargaining scenarios are released.

6arXiv · cs.CL·9d ago·source ↗

Agentopia: Long-term multi-agent life simulation framework for training LLMs on social behavior

Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.

5arXiv · cs.CL·22d ago·source ↗

PolyGnosis 2.0: Multi-Agent Architecture for Prediction Market Intelligence via Harness Engineering

PolyGnosis 2.0 introduces a multi-agent system that synthesizes Polymarket prediction market signals with GDELT OSINT streams to identify 'Perspective Mismatches' as trading signals. The paper rigorously evaluates agentic harness engineering techniques—reflection loops, tool-calling, divide-and-conquer partitioning, and chain-of-thought—in high-noise financial domains. Key empirical findings include that structural partitioning is necessary for multi-dimensional alignment, but unconstrained terminal reflection induces logical drift, and a pervasive consensus bias emerges across agent configurations. The authors identify a Pareto-optimal configuration achieving professional-grade analytical precision with minimized latency and token overhead.