5arXiv cs.AI (Artificial Intelligence)·3d ago

ANTAP: Geometry-based routing defense against malicious agents in multi-agent systems

Researchers introduce ANTAP (Automatic Non-Textual Agent Picker), a routing architecture for multi-agent LLM systems that replaces text-based agent self-descriptions with empirical capability testing and algebraic projection in a shared semantic space. The approach creates a 'linguistic firewall' that makes metadata-based injection attacks inexpressible at inference time. Against description-based injection attacks, ANTAP achieves near-zero attack success rate compared to 67.3%+ for baseline routers, and reduces embedding-based attack success by 20%.

AI Safety Research Agent and Tool Ecosystem ANTAP Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Principles to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·4d ago·source ↗

Agent-Native Immune System (ANIS): Biologically-inspired runtime defense architecture for autonomous AI agents

A new arXiv preprint introduces the Agent-Native Immune System (ANIS), a defense architecture embedded within an agent's cognitive loop rather than applied externally at training time or perimeter. The framework proposes a six-layer Immune Tower, a taxonomy of Agent Viruses and Agent Vaccines distinguishing parametric from non-parametric defenses, and a self-monitoring Harness Triad enabling Continual Immune Learning. The paper draws a formal distinction between static training-time alignment and dynamic runtime immunity, arguing current defenses leave agents vulnerable to memory poisoning, tool-chain manipulation, and multi-agent protocol attacks.

AI Safety Research Agent and Tool Ecosystem Agent-Native Immune System Agent-Native Immune System: Architecture, Taxonomy, and Engineering

6arXiv · cs.AI·3d ago·source ↗

MESA framework proactively ranks vulnerable communication channels in multi-agent systems

Researchers introduce MESA, a label-free framework for prioritizing security-critical communication edges in multi-agent systems (MAS) before attacks are observed. The framework combines six graph-theoretic metrics with two dynamic probes (ablation and masking) to rank edges by compromise risk, without requiring attack traces. Evaluated across three MAS scenarios, eight network topologies, and five open-source LLMs, MESA achieves mean Spearman ρ=+0.60 correlation with empirical per-edge attack success, and monitoring the top 10% of ranked edges intercepts roughly 3x more successful attacks than random allocation. The work highlights that attack impact in MAS is highly concentrated — a single compromised edge can account for up to 75% of total attack success.

AI Safety Research Agent and Tool Ecosystem MESA MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems Gemma +3 more

7arXiv · cs.AI·1mo ago·source ↗

Stateful Online Monitoring Catches Distributed Agent Attacks via Cross-Account Clustering

Researchers demonstrate the first known distributed agent attack, a multi-agent scaffold that splits harmful cybersecurity tasks across many user accounts to evade per-transcript safety monitors, reducing detection rates to roughly one-fifth of standard attacks. As a defense, they develop a stateful online monitor that clusters weak suspiciousness signals across many agent transcripts in real time, escalating only rarely to a full LM-based review. In large-scale simulated datacenter traffic evaluations, the monitor Pareto-dominates standard monitors by catching distributed attacks 30% earlier with negligible latency overhead for ~99% of traffic. The system also incidentally catches standard jailbreaks, since adaptive attackers tend to reuse attack variants across accounts.

Evaluation and Benchmarking Inference Economics Real-Time Clustering Stateful Online Monitor Distributed Agent Attack +5 more

5Hugging Face Blog·1mo ago·source ↗

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

ServiceNow AI has released AprielGuard, a guardrail system designed to improve safety and adversarial robustness in LLM deployments. The system targets prompt injection, jailbreaks, and other adversarial inputs that bypass standard safety measures. It is presented as a component for enterprise LLM pipelines seeking more robust content moderation and safety filtering.

AI Safety Research Enterprise Deployment Patterns ServiceNow AI AprielGuard Hugging Face +1 more

7arXiv · cs.CL·1mo ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more

5arXiv · cs.CL·3d ago·source ↗

PRP: Proactive Routing Paradigm for efficient visual reasoning via early difficulty estimation

Researchers introduce PRP (Proactive Routing Paradigm), a method for routing multimodal queries to either a small draft model or a large target model before generating a full output, based on jointly estimated difficulty signals. The approach uses Draft Rating Learning to give the draft model an internal confidence estimator and Joint Rating Learning to predict target model competence, enabling fine-grained instance-level routing. Experiments across multimodal reasoning benchmarks show inference acceleration without accuracy loss. The work addresses a gap in speculative/cooperative inference for vision-language models, where existing language-model routing methods fail.

Inference Economics Multimodal Progress Draft Rating Learning PRP (Proactive Routing Paradigm)Joint Rating Learning

6arXiv · cs.LG·22d ago·source ↗

APPO: Fine-grained branching and credit assignment for agentic RL in LLMs

Researchers introduce Agentic Procedural Policy Optimization (APPO), a reinforcement learning method that shifts branching and credit assignment from coarse tool-call boundaries to fine-grained decision points within generated sequences. APPO uses a Branching Score combining token uncertainty with policy-induced likelihood gains to select exploration points, plus procedure-level advantage scaling for credit distribution. Evaluated on 13 benchmarks, APPO improves strong agentic RL baselines by nearly 4 points while maintaining efficient tool use and interpretability. The work addresses a known weakness in multi-turn agentic RL: that influential decisions are distributed throughout sequences, not concentrated at tool-call boundaries.

Agent and Tool Ecosystem Alignment and RLHF APPO: Agentic Procedural Policy Optimization

4arXiv · cs.CL·1mo ago·source ↗

MA²P: A Meta-Cognitive Multi-Agent Framework for Complex Persuasion

The paper introduces MA²P, a multi-agent framework designed for complex persuasion tasks where the persuadee's internal states are latent. The system coordinates perception management, mental-state inference, strategy execution, memory, and evaluation modules, and adds a meta-cognitive configurator that selects domain-appropriate strategies from a structured knowledge base to reduce cross-domain performance variance. Experiments show higher persuasion success rates compared to baselines. The work addresses a known weakness of LLMs in producing generic or weakly grounded persuasive responses.

Agent and Tool Ecosystem Alignment and RLHF large language models meta-cognitive configurator MA²P +1 more