5arXiv cs.LG (Machine Learning)·26d ago

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS is a three-layer multi-agent architecture addressing temporal degradation in knowledge-graph data marketplaces, combining neural-ODE-based shortcut decay, changepoint-conditioned Shapley pricing, and EXP3-IX-driven differential privacy budget management. The system achieves 0.937 recall@10, 2.74 QPS, and 161ms latency under a total epsilon of 4.25 (delta=1e-6) using zCDP composition across four benchmarks. A key limitation noted is that at this privacy level, released valuations remain noise-dominated, with utility primarily derived from public index routing. The work provides formal guarantees including per-query recall-loss bounds and finite-sample Shapley error bounds under distribution shift.

Evaluation and Benchmarking AI Safety Research Agent and Tool Ecosystem Differential Privacy CHRONOS Gaussian mechanism zCDP composition temporal knowledge graph neural-ODE temporal decay EXP3-IX Shapley valuation

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·25d ago·source ↗

PolyGnosis 2.0: Multi-Agent Architecture for Prediction Market Intelligence via Harness Engineering

PolyGnosis 2.0 introduces a multi-agent system that synthesizes Polymarket prediction market signals with GDELT OSINT streams to identify 'Perspective Mismatches' as trading signals. The paper rigorously evaluates agentic harness engineering techniques—reflection loops, tool-calling, divide-and-conquer partitioning, and chain-of-thought—in high-noise financial domains. Key empirical findings include that structural partitioning is necessary for multi-dimensional alignment, but unconstrained terminal reflection induces logical drift, and a pervasive consensus bias emerges across agent configurations. The authors identify a Pareto-optimal configuration achieving professional-grade analytical precision with minimized latency and token overhead.

Evaluation and Benchmarking Agent and Tool Ecosystem PolyGnosis 2.0 Divide-and-Conquer Partitioning Harness Engineering +4 more

6arXiv · cs.CL·11d ago·source ↗

CHAP: Collaborative Human-Agent Protocol for structured human-AI accountability in multi-agent deployments

Researchers from BrightbeamAI introduce CHAP (Collaborative Human-Agent Protocol), a protocol specification for formalizing human-agent collaboration in production multi-agent systems. CHAP defines shared workspaces, structured override events with diffs and rationales, non-repudiable signed approvals, and an append-only evidence log, filling a gap left by MCP (tool access) and A2A (agent-to-agent interoperability). The protocol ships with a reference implementation, conformance suite, and worked examples. It targets high-stakes deployments in domains like clinical decisions, contracts, and code where human judgment must be auditable and replayable.

AI Safety Research Agent and Tool Ecosystem BrightbeamAI Collaborative Human-Agent Protocol MCP +1 more

4arXiv · cs.LG·15d ago·source ↗

DNQ: Deep Nash Q-Network framework for equilibrium learning in multi-agent bidding games

Researchers propose DNQ (Deep Nash Q-Network), a solver-in-the-loop framework for training agents to reach Nash equilibria in partially observable n-player simultaneous bidding games. The method alternates between trajectory collection, critic-based payoff estimation, external equilibrium computation, and policy imitation via KL divergence minimization. A scalable pairwise payoff formulation is shown to outperform the exact N-player tensor approach in computational cost while maintaining strategic quality, with experiments demonstrating the trade-off between fidelity and scalability as agent count grows.

DNQ DNQ: Deep Nash Q-Network for Partially Observable n-Player Games

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

7arXiv · cs.AI·23d ago·source ↗

Calibrated Collective Oversight (CCO): Scalable Oversight with Finite-Time Statistical Guarantees

This paper introduces Calibrated Collective Oversight (CCO), a framework for maintaining human oversight of agentic AI systems that may exceed human capabilities. CCO aggregates diverse scoring functions into a conservatism penalty inspired by Attainable Utility Preservation, then calibrates this penalty online via Conformal Decision Theory to ensure undesirable outcomes stay below a user-specified threshold with finite-time bounds and no distributional assumptions. Evaluated on a modified SWE-bench (adversarially misaligned agent) and MACHIAVELLI (ethical violations), CCO allows weaker overseers to constrain stronger agents while preserving reward, with empirical violation rates closely matching specified targets.

Evaluation and Benchmarking AI Safety Research Calibrated Collective Oversight (CCO)Attainable Utility Preservation Conformal Decision Theory +4 more

6arXiv · cs.CL·29d ago·source ↗

ChronoMedKG: Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

ChronoMedKG is a new biomedical knowledge graph containing 460,497 evidence-linked triples across 13,431 diseases, each annotated with temporal components such as onset window and progression stage. It is constructed via a multi-agent pipeline using multiple frontier LLMs extracting from PubMed/PMC, with multi-model consensus and credibility filtering. The accompanying ChronoTQA benchmark (3,341 questions) reveals frontier LLMs lose ~30 points on temporal vs. static clinical questions, while ChronoMedKG-based retrieval recovers 47–65% of long-tail failures compared to 17–29% for HPOA-RAG. The work addresses a significant gap in existing KGs (PrimeKG, Hetionet, iKraph) that treat disease associations as static facts.

Evaluation and Benchmarking Enterprise Deployment Patterns Phenopackets PubMed ChronoTQA +8 more

5arXiv · cs.CL·8d ago·source ↗

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.

Evaluation and Benchmarking Agent and Tool Ecosystem EvoArena GAIA LoCoMo +1 more

6arXiv · cs.AI·8d ago·source ↗

DoorDash deploys multi-agent RL system for adaptive dispatch objective weights in food-delivery marketplace

Researchers at DoorDash present a deployed reinforcement learning system that adapts dispatch objective weights in a three-sided food-delivery marketplace using delayed operational feedback signals. Rather than replacing the combinatorial optimizer, a store-level policy selects discrete multipliers that shift the optimizer's tradeoff between delivery quality and batching efficiency. The system uses centralized offline training with Double Q-learning and a conservative regularizer to handle out-of-distribution overestimation, then executes decentrally per store. A production switchback experiment shows increased batching and reduced courier time costs without degrading customer delivery quality.

Enterprise Deployment Patterns Agent and Tool Ecosystem Double Q-learning DoorDash Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch