6arXiv cs.AI (Artificial Intelligence)·3d ago

MESA framework proactively ranks vulnerable communication channels in multi-agent systems

Researchers introduce MESA, a label-free framework for prioritizing security-critical communication edges in multi-agent systems (MAS) before attacks are observed. The framework combines six graph-theoretic metrics with two dynamic probes (ablation and masking) to rank edges by compromise risk, without requiring attack traces. Evaluated across three MAS scenarios, eight network topologies, and five open-source LLMs, MESA achieves mean Spearman ρ=+0.60 correlation with empirical per-edge attack success, and monitoring the top 10% of ranked edges intercepts roughly 3x more successful attacks than random allocation. The work highlights that attack impact in MAS is highly concentrated — a single compromised edge can account for up to 75% of total attack success.

AI Safety Research Agent and Tool Ecosystem MESA MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems Gemma LangGraph Qwen Llama

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Principles to Real-World Flashpoints

Read asBeginner In-depth

Qwen

Qwen: Alibaba's Open-Weight AI Lab Pushing the Frontier

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·14d ago·source ↗

MosaicLeaks: Benchmark for evaluating secret-keeping in research agents

ServiceNow published a post on Hugging Face introducing MosaicLeaks, an evaluation focused on whether research agents can maintain confidentiality of sensitive information during task execution. The work targets a specific safety and alignment concern for agentic systems: information leakage during multi-step research workflows. This is relevant to the growing body of work on agent safety and trustworthiness in enterprise contexts.

Evaluation and Benchmarking AI Safety Research ServiceNow AI MosaicLeaks Hugging Face +1 more

5arXiv · cs.AI·3d ago·source ↗

ANTAP: Geometry-based routing defense against malicious agents in multi-agent systems

Researchers introduce ANTAP (Automatic Non-Textual Agent Picker), a routing architecture for multi-agent LLM systems that replaces text-based agent self-descriptions with empirical capability testing and algebraic projection in a shared semantic space. The approach creates a 'linguistic firewall' that makes metadata-based injection attacks inexpressible at inference time. Against description-based injection attacks, ANTAP achieves near-zero attack success rate compared to 67.3%+ for baseline routers, and reduces embedding-based attack success by 20%.

AI Safety Research Agent and Tool Ecosystem ANTAP Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing

7arXiv · cs.AI·1mo ago·source ↗

Stateful Online Monitoring Catches Distributed Agent Attacks via Cross-Account Clustering

Researchers demonstrate the first known distributed agent attack, a multi-agent scaffold that splits harmful cybersecurity tasks across many user accounts to evade per-transcript safety monitors, reducing detection rates to roughly one-fifth of standard attacks. As a defense, they develop a stateful online monitor that clusters weak suspiciousness signals across many agent transcripts in real time, escalating only rarely to a full LM-based review. In large-scale simulated datacenter traffic evaluations, the monitor Pareto-dominates standard monitors by catching distributed attacks 30% earlier with negligible latency overhead for ~99% of traffic. The system also incidentally catches standard jailbreaks, since adaptive attackers tend to reuse attack variants across accounts.

Evaluation and Benchmarking Inference Economics Real-Time Clustering Stateful Online Monitor Distributed Agent Attack +5 more

6arXiv · cs.AI·14d ago·source ↗

Contagion Networks: formal framework for measuring evaluator bias propagation in multi-agent LLM systems

A new arXiv preprint introduces Contagion Networks, a formal framework for quantifying how systematic evaluation biases spread across interacting LLM agents in multi-agent systems. Using a controlled 3-agent experiment with DeepSeek-chat, the authors measure a Cross-Agent Contagion Matrix and find that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model settings. A key practical finding is that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, offering a concrete mitigation strategy. The authors release an open-source experimental framework alongside the paper.

Evaluation and Benchmarking AI Safety Research Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems MM-EPC deepseek-chat +1 more

8Anthropic News·1mo ago·source ↗

Anthropic maps 832 AI-enabled cyberattacks, finds MITRE ATT&CK framework inadequate for agentic threats

Anthropic's Frontier Red Team analyzed 832 accounts banned for malicious cyber activity between March 2025 and March 2026, mapping their techniques against the MITRE ATT&CK framework. Key findings: medium-or-higher-risk actors grew from 33% to 56% across the study period; AI use is shifting from initial-access techniques toward post-compromise operations like lateral movement and privilege escalation; and traditional risk signals (technique count, platform used) no longer reliably distinguish threat levels. The report concludes that MITRE ATT&CK lacks coverage for agentic orchestration behaviors—where AI chains attack stages autonomously with minimal human input—which characterize the highest-risk actors, including a state-sponsored espionage operation disrupted in November 2025.

AI Safety Research Regulatory Developments Anthropic Policy Frontier Red Team MITRE ATT&CK Claude Code +3 more

7arXiv · cs.CL·17d ago·source ↗

SearchGEO framework measures LLM search agent vulnerability to web content manipulation

Researchers introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a manipulation pipeline, five-mode attack taxonomy, and multiple output metrics. Evaluating 13 LLM backends on 308 cases each, they find attack success rates ranging from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, with model-family-specific vulnerability patterns. An auxiliary probe escalating endorsement to install commands reveals a behavioral split: Claude over-rejects while GPT over-trusts. The findings argue for treating adversarial search content robustness as a first-class safety evaluation dimension for deployed agents.

Evaluation and Benchmarking AI Safety Research Claude Sonnet 4 Google Gemini 3 Flash +4 more

6arXiv · cs.CL·1mo ago·source ↗

HarmAmp Benchmark and TrajSafe Monitor for Multi-Turn Harm Amplification in LLMs

This paper introduces HarmAmp, a benchmark covering twelve risk categories designed to evaluate how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The authors also propose TrajSafe, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs. Experiments show TrajSafe reduces multi-turn harmfulness while maintaining low over-refusal rates and preserving general model capabilities. The work highlights a gap in existing safety research that focuses on single-turn evaluations rather than extended interaction dynamics.

Evaluation and Benchmarking AI Safety Research large language models HarmAmp TrajSafe +1 more

7Meta Ai Blog·1mo ago·source ↗

Meta Publishes Advanced AI Scaling Framework and Safety & Preparedness Report for Muse Spark

Meta has released an updated Advanced AI Scaling Framework that expands risk evaluation categories—including chemical/biological threats, cybersecurity, and loss-of-control risks—and introduces formal Safety & Preparedness Reports tied to specific model deployments. The first such report covers Muse Spark, Meta's advanced reasoning model, detailing pre- and post-safeguard evaluations across severe risk categories and ideological balance. Meta also describes a shift in safety methodology: rather than scenario-specific refusal training, Muse Spark is trained on the reasoning behind safety principles, enabling more generalizable behavior in novel situations. The framework applies across open, API, and closed deployments.

Frontier Model Releases Evaluation and Benchmarking Advanced AI Scaling Framework Meta AI Frontier AI Framework +6 more