MAS-PromptBench: Systematic study of prompt optimization in multi-agent LLM systems
A new arXiv preprint introduces MAS-PromptBench, a benchmark and study examining when and how much system-prompt optimization improves multi-agent LLM systems (MAS). The authors evaluate two prompt optimizers across diverse MAS configurations varying in task, workflow, communication protocol, and team size. Results show prompt optimization can unlock significant gains but also expose open challenges, particularly around the exponentially growing search space as agent count increases.
Related guides (2)
Related events (8)
EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents
EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.
Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
This paper investigates multi-objective prompt optimization for LLM-as-judge systems, testing five decomposition modes of textual gradient optimizers across varying levels of cross-task information sharing. In 6 of 10 configurations, optimization fails to improve over the initial prompt, with gradient specificity dropping 59% when multiple criteria are processed jointly. The authors identify two separable failure modes: gradient dilution at optimization time and instruction interference at inference time. These findings constrain the design space for customizing LLM judges via textual feedback across multiple evaluation criteria simultaneously.
Structured Prompt Checklists Outperform Raw and Clarifying-Question Prompts Across LLMs
This paper compares three prompt design strategies—raw prompts, checklist-improved prompts, and clarifying-question prompts—across four task types and three LLM systems (ChatGPT, Claude, Grok). Checklist-improved prompts achieved the highest mean rubric score (7.50/8) versus 5.67 for raw and 6.67 for clarifying-question prompts. Checklist prompts also used fewer tokens on average, suggesting a favorable quality-effort tradeoff. The study provides empirical grounding for structured prompt engineering as a practical technique to reduce multi-turn interaction overhead.
PlanBench-XL: Benchmark for LLM Agent Planning in Large-Scale Tool Ecosystems
Researchers introduce PlanBench-XL, an interactive benchmark of 327 retail tasks spanning 1,665 tools designed to evaluate LLM agents on long-horizon planning under retrieval-limited tool visibility. The benchmark includes a blocking mechanism simulating real-world disruptions such as missing or failing tools, forcing agents to detect and recover from broken execution paths. Experiments on ten leading LLMs reveal severe performance degradation: GPT-5.4 drops from 51.90% accuracy in unblocked settings to 11.36% under the most severe blocking condition, highlighting fragility in adaptive planning for large, imperfect tool environments.
MA²P: A Meta-Cognitive Multi-Agent Framework for Complex Persuasion
The paper introduces MA²P, a multi-agent framework designed for complex persuasion tasks where the persuadee's internal states are latent. The system coordinates perception management, mental-state inference, strategy execution, memory, and evaluation modules, and adds a meta-cognitive configurator that selects domain-appropriate strategies from a structured knowledge base to reduce cross-domain performance variance. Experiments show higher persuasion success rates compared to baselines. The work addresses a known weakness of LLMs in producing generic or weakly grounded persuasive responses.
How Long Prompts Block Other Requests - Optimizing LLM Performance
This Hugging Face blog post from TNG Technology Consulting examines how long prompts create head-of-line blocking in LLM serving systems, degrading latency for concurrent requests. The post analyzes the mechanics of prompt processing in inference pipelines and discusses optimization strategies to mitigate throughput bottlenecks caused by lengthy context inputs. It is framed as a practical guide for teams deploying LLMs in production environments where mixed prompt-length workloads are common.
TriggerBench: A benchmark for evaluating prospective memory in LLMs
Researchers introduce TriggerBench, a benchmark evaluating prospective memory (PM) in LLMs — the ability to spontaneously recall and act on latent constraints without explicit prompting. The benchmark spans five dimensions across daily assistant and professional workflow scenarios, and reveals that PM is substantially harder than retrospective memory, decaying sharply with context length while retrospective memory near-saturates at 100K tokens. Key findings include a precision-recall trade-off in PM, attentional fragility under concurrent requests, and a novel result that PM accuracy correlates with spare reasoning capacity as measured against AIME-2025 math performance.
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

