Entity · paper

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

paperactive

multi-agent-reinforcement-learning-from-delayed-marketplace-feedback-for-objective-weight-adaptation-in-three-sided-dispatch-32986327

·1 events·first seen Jun 12, 2026

Aliases: Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

Co-occurring entities

Double Q-learning DoorDash

More like this (12)

Active Offline-to-Online Reinforcement Learning Reward Modeling for Multi-Agent Orchestration Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes Preference Coordinated Multi-agent Policy Optimization Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback Reinforcement Learning from Rich Feedback with Distributional DAgger multi-turn agent benchmarks UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning Physics-EnhAnced Reinforcement Learning QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

Recent events (1)

6arXiv · cs.AI·Jun 12, 2026·source ↗

DoorDash deploys multi-agent RL system for adaptive dispatch objective weights in food-delivery marketplace

Researchers at DoorDash present a deployed reinforcement learning system that adapts dispatch objective weights in a three-sided food-delivery marketplace using delayed operational feedback signals. Rather than replacing the combinatorial optimizer, a store-level policy selects discrete multipliers that shift the optimizer's tradeoff between delivery quality and batching efficiency. The system uses centralized offline training with Double Q-learning and a conservative regularizer to handle out-of-distribution overestimation, then executes decentrally per store. A production switchback experiment shows increased batching and reduced courier time costs without degrading customer delivery quality.

Enterprise Deployment Patterns Agent and Tool Ecosystem Double Q-learning DoorDash Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch