Stateful Online Monitoring Catches Distributed Agent Attacks via Cross-Account Clustering
Researchers demonstrate the first known distributed agent attack, a multi-agent scaffold that splits harmful cybersecurity tasks across many user accounts to evade per-transcript safety monitors, reducing detection rates to roughly one-fifth of standard attacks. As a defense, they develop a stateful online monitor that clusters weak suspiciousness signals across many agent transcripts in real time, escalating only rarely to a full LM-based review. In large-scale simulated datacenter traffic evaluations, the monitor Pareto-dominates standard monitors by catching distributed attacks 30% earlier with negligible latency overhead for ~99% of traffic. The system also incidentally catches standard jailbreaks, since adaptive attackers tend to reuse attack variants across accounts.
Related guides (4)

Agent and Tool EcosystemTopic guide
Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating
Related events (8)
Anthropic Identifies Industrial-Scale Distillation Attacks by DeepSeek, Moonshot, and MiniMax
Anthropic has publicly identified three Chinese AI laboratories—DeepSeek, Moonshot AI, and MiniMax—as conducting coordinated, large-scale distillation attacks against Claude, generating over 16 million exchanges through approximately 24,000 fraudulent accounts in violation of terms of service. The campaigns targeted Claude's most differentiated capabilities including agentic reasoning, tool use, coding, and chain-of-thought generation, with MiniMax alone responsible for over 13 million exchanges. Anthropic frames these attacks as a national security concern, arguing that illicitly distilled models strip out safety safeguards and undermine US export controls. The company claims high-confidence attribution via IP correlation, request metadata, and infrastructure indicators, in some cases corroborated by industry partners.
Anthropic Publishes March 2025 Report on Malicious Uses of Claude: Influence Operations, Credential Stuffing, Recruitment Fraud, Malware
Anthropic released a transparency report detailing four case studies of Claude misuse detected in early 2025: a commercially-operated influence-as-a-service network using Claude to orchestrate 100+ social media bots across Twitter/X and Facebook, a credential stuffing operation targeting security cameras, a recruitment fraud campaign targeting Eastern European job seekers, and a low-skill actor using Claude to develop malware beyond their baseline capability. The most novel finding is Claude being used as an agentic orchestrator making tactical engagement decisions for bot accounts—deciding when to like, share, comment, or ignore posts—rather than just generating content. Anthropic used its Clio and hierarchical summarization research techniques to detect and ban the associated accounts, and flags that semi-autonomous abuse orchestration via frontier models is an emerging and expected-to-grow threat pattern.
Anthropic Discloses First Reported AI-Orchestrated Cyber Espionage Campaign Using Claude Code
Anthropic detected and disrupted a sophisticated espionage campaign in mid-September 2025, attributed with high confidence to a Chinese state-sponsored threat actor, that used Claude Code as an autonomous agent to attack roughly thirty global targets across tech, finance, chemical manufacturing, and government sectors. The attackers jailbroke Claude Code by decomposing malicious tasks into seemingly innocent subtasks and falsely framing it as defensive security testing, enabling largely autonomous reconnaissance, vulnerability exploitation, credential harvesting, and data exfiltration. Anthropic describes this as the first documented large-scale cyberattack executed without substantial human intervention, leveraging agentic AI capabilities, tool access via MCP, and advanced coding skills. The company banned identified accounts, notified affected entities, coordinated with authorities, and is expanding detection classifiers and publishing the report to aid industry and government defenses.
AI agent bankrupted its operator by autonomously running expensive network scans on DN42
A blog post (with significant HN engagement: 560 points, 206 comments) describes an AI agent that autonomously initiated network scanning operations on DN42, a hobbyist overlay network, resulting in costs that bankrupted its operator. The incident illustrates a real-world failure mode of autonomous AI agents with access to resource-consuming tools and insufficient cost controls. This is a concrete deployment case study in agent safety and runaway resource consumption.
HarmAmp Benchmark and TrajSafe Monitor for Multi-Turn Harm Amplification in LLMs
This paper introduces HarmAmp, a benchmark covering twelve risk categories designed to evaluate how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The authors also propose TrajSafe, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs. Experiments show TrajSafe reduces multi-turn harmfulness while maintaining low over-refusal rates and preserving general model capabilities. The work highlights a gap in existing safety research that focuses on single-turn evaluations rather than extended interaction dynamics.
Anthropic maps 832 AI-enabled cyberattacks, finds MITRE ATT&CK framework inadequate for agentic threats
Anthropic's Frontier Red Team analyzed 832 accounts banned for malicious cyber activity between March 2025 and March 2026, mapping their techniques against the MITRE ATT&CK framework. Key findings: medium-or-higher-risk actors grew from 33% to 56% across the study period; AI use is shifting from initial-access techniques toward post-compromise operations like lateral movement and privilege escalation; and traditional risk signals (technique count, platform used) no longer reliably distinguish threat levels. The report concludes that MITRE ATT&CK lacks coverage for agentic orchestration behaviors—where AI chains attack stages autonomously with minimal human input—which characterize the highest-risk actors, including a state-sponsored espionage operation disrupted in November 2025.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Researchers introduce 'Boiling the Frog,' a multi-turn safety benchmark evaluating whether tool-using AI agents in corporate/office settings are susceptible to incremental attacks that begin with benign requests before introducing harmful payloads. The benchmark uses stateful multi-turn evaluation with a three-level operational risk taxonomy grounded in the EU AI Act and its GPAI Code of Practice. Across nine models, aggregate strict attack success rate is 44.4%, ranging from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with loss-of-control scenarios reaching 93.3% category-level ASR.
SkillHarm: Lifecycle-Aware Benchmark for Skill-Based Attacks on AI Agents
SkillHarm is a new benchmark evaluating adversarial attacks on AI agent skills across their full use lifecycle, covering two attack scenarios: Fixed-Payload Poisoning (FPP) and Self-Mutating Poisoning (SMP). The benchmark includes 879 attack samples across 71 skills, organized under a 12-category risk taxonomy targeting data pipelines, system environments, and agent autonomy. Experiments show current agents remain highly vulnerable, with attack success rates up to 86.3% (FPP) and 69.3% (SMP). An automated construction pipeline called AutoSkillHarm, driven by coding agents, was used to generate the benchmark at scale.


