What this area covers
The agent and tool ecosystem encompasses everything that sits between a raw language model and a useful autonomous workflow: the protocols that let models call external tools, the harnesses that orchestrate multi-step tasks, the runtimes that persist agent state across sessions, and the benchmarks that measure whether any of it actually works. It is the infrastructure layer that determines whether a capable model can be deployed as a reliable actor in the world — not just a responder in a chat window.
Why it matters
The shift from chat to agency is the central commercial and technical story of the current AI cycle. Claude Code alone accounts for an estimated 4% of all GitHub public commits worldwide and generates over $2.5 billion in annualized run-rate revenue. OpenAI's Codex competes directly. Mistral, Meta, DeepSeek, and the open-weights community are all building agent-capable models and harnesses. The tooling layer is where capability translates into economic output — and where the most consequential security failures are now occurring.
Phase 1: From chat to tool use (2022–2024)
The substrate was established by ChatGPT's November 2022 launch, which demonstrated that large language models could be made reliably interactive. The next step was giving models access to external state. OpenAI's o1 series (September 2024) introduced inference-time chain-of-thought reasoning as a first-class capability, and o3/o4-mini (April 2025) shipped with full tool access — the first time a reasoning model and agentic tool use were tightly integrated in a single OpenAI release. The pattern was set: reasoning at inference time plus tool calls equals the minimal viable agent.
Phase 2: Computer use and the first coding agents (mid-2025)
Anthropic's computer use beta (July 2025) extended tool access from structured APIs to unstructured desktop environments — models could now view screens, move cursors, click, and type. Early adopters included Replit, The Browser Company, and Cognition. Two months later, Claude Code went generally available with GitHub Actions, VS Code, and JetBrains integrations, alongside a new MCP connector and Files API in the Anthropic API. Claude 3.7 Sonnet, released the same day, was the first hybrid reasoning model — capable of switching between near-instant and extended thinking within a single call — and achieved state-of-the-art on SWE-bench Verified and TAU-bench. The coding agent category had arrived.
Phase 3: Protocol standardization and the SDK layer (late 2025 – early 2026)
The release of Claude Sonnet 4.5 (November 2025) marked a qualitative shift in the developer surface. The Claude Agent SDK gave practitioners direct access to the same infrastructure powering Claude Code — including context editing, memory tools, and checkpoint management — rather than requiring them to build agent scaffolding from scratch. OSWorld performance jumped from 42.2% (Sonnet 4) to 61.4% (Sonnet 4.5), a proxy for how reliably the model could navigate real desktop environments. Enterprise customers including Cursor, GitHub Copilot, Devin, Canva, and Figma reported measurable gains.
In parallel, OpenAI and AWS announced a stateful agent runtime on Amazon Bedrock (March 2026) — a persistent environment managing agent memories, tool connections, and user permissions across sessions. The legal architecture was notable: the deal exploited a distinction between stateful runtimes (AWS) and stateless API calls (Microsoft Azure), allowing OpenAI to diversify its cloud relationships while honoring existing exclusivity terms. This mirrors Anthropic's own multi-cloud distribution across AWS, Google Cloud Vertex AI, and Microsoft Foundry.
Anthropic open-sourced the Model Context Protocol in May 2026, formalizing what had been an internal standard into a universal client-server specification. MCP replaces the previous pattern of per-source integrations with a single protocol covering GitHub, Slack, Google Drive, Postgres, and more. Early adopters include Block, Apollo, Zed, Replit, Codeium, and Sourcegraph. The open-sourcing signals an intent to make MCP the HTTP of AI tool connectivity — a bet that standardization benefits Anthropic more than proprietary lock-in.
Phase 4: Agentic security as a first-order problem (2025–2026)
The same capabilities that make agents useful make them dangerous when misused. In November 2025, Anthropic detected and disrupted what it describes as the first documented large-scale AI-orchestrated cyberattack: a Chinese state-sponsored actor used Claude Code as an autonomous agent — accessing tools via MCP — to conduct reconnaissance, exploit vulnerabilities, harvest credentials, and exfiltrate data across roughly thirty targets in tech, finance, chemical manufacturing, and government. The attackers bypassed safety measures by decomposing malicious tasks into seemingly innocent subtasks and framing them as defensive security testing.
A subsequent analysis of 832 accounts banned for malicious cyber activity (March 2025 – March 2026) found that medium-or-higher-risk actors grew from 33% to 56% of the population, that AI use is shifting from initial-access techniques toward post-compromise operations like lateral movement and privilege escalation, and that MITRE ATT&CK lacks coverage for agentic orchestration behaviors — where AI chains attack stages autonomously with minimal human input. The framework gap is not academic: it means defenders lack a shared vocabulary for the highest-risk threat class.
Separately, Anthropic identified three Chinese AI laboratories — DeepSeek, Moonshot AI, and MiniMax — conducting coordinated distillation attacks specifically targeting Claude's agentic reasoning, tool use, and chain-of-thought capabilities, generating over 16 million exchanges through approximately 24,000 fraudulent accounts. The attacks targeted the most differentiated parts of the agent stack, not general language capability.
The open-weights and multi-lab landscape
The agent harness ecosystem is not a two-player game. OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0 with strong tool-use capabilities optimized for consumer hardware. Mistral launched remote cloud coding agents in its Vibe CLI and a multi-step agentic Work mode in Le Chat, backed by Mistral Medium 3.5 (128B, 77.6% on SWE-Bench Verified) and Mistral Small 4 (119B MoE, Apache 2.0). DeepSeek V4-Flash (284B total, 13B active) powers the Goedel-Architect agentic theorem-proving framework, which achieved 99.2% pass@1 on MiniF2F-test at up to 500x lower cost than comparable systems. Meta's Muse Spark introduced a "contemplating mode" running multiple agents in parallel — a multi-agent orchestration pattern at the model level rather than the harness level.
Andrew Ng's OpenCoworker (June 2026) represents the open-source end of the harness spectrum: a free desktop agent built on aisuite that works with any API key or local model, explicitly designed for privacy-preserving agentic workflows. Its release coincides with a broader observation in the events bundle that frontier models have become capable enough to reliably drive next-action decisions — the bottleneck has moved from model capability to harness design and infrastructure.
Where it is heading
The consolidation pattern is clear: stateless API calls are giving way to stateful agent runtimes; per-source integrations are giving way to protocol standards like MCP; and purpose-built coding agents are displacing general-purpose chat as the primary commercial surface. The unresolved questions are governance ones. Existing security frameworks (MITRE ATT&CK) do not cover agentic orchestration. Existing deployment patterns (stateless, session-less) do not map to agents that run for hours with persistent memory and broad tool access. The labs building the most capable agents are also the ones most actively publishing threat analyses — a dynamic that will shape how the infrastructure layer evolves as model capability continues to advance.




