Almanac
Topic guide · In-depth

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Agent and Tool EcosystemIn-depthactive·v1 · live·generated 6d ago
TL;DRThe agent and tool ecosystem began as a loose collection of one-off integrations bolted onto chat models, but has rapidly consolidated around standardized protocols, purpose-built agent runtimes, and a new class of coding agents that now account for measurable fractions of global software output. The central tension has shifted from whether models can reliably drive multi-step tasks to how to govern, secure, and scale the infrastructure that lets them do so autonomously — a question made urgent by the first documented large-scale AI-orchestrated cyberattack and the emergence of models capable of autonomously discovering thousands of critical vulnerabilities.

Key takeaways

  • Anthropic's Model Context Protocol (MCP), released as an open standard, replaced fragmented per-source integrations with a single client-server protocol covering GitHub, Slack, Google Drive, Postgres, and more — and was subsequently exploited as a tool-access layer in the first documented AI-orchestrated espionage campaign (November 2025).
  • Claude Code went GA in September 2025 and by early 2026 was generating over $2.5B in annualized run-rate revenue and accounting for an estimated 4% of all GitHub public commits worldwide; OpenAI's Codex launched as a direct competitor powered by GPT-5.4.
  • The Claude Agent SDK (released with Sonnet 4.5) gave developers access to the same agent infrastructure powering Claude Code, while OpenAI and AWS announced a stateful runtime environment for agents on Amazon Bedrock — both moves signal a shift from stateless API calls to persistent, memory-bearing agent runtimes.
  • Anthropic's Frontier Red Team mapped 832 AI-enabled cyberattacks and found that MITRE ATT&CK lacks coverage for agentic orchestration behaviors, where AI chains attack stages autonomously — the highest-risk actor profile grew from 33% to 56% of banned accounts between March 2025 and March 2026.
  • Meta's Muse Spark introduced a 'contemplating mode' running multiple agents in parallel, and the Goedel-Architect framework demonstrated blueprint-based agentic theorem proving at up to 500x lower cost than comparable systems — illustrating how agent architecture design is itself becoming a research frontier.
  • OpenAI released open-weight models (gpt-oss-120b and gpt-oss-20b, Apache 2.0) with strong tool-use capabilities, and Mistral launched remote cloud coding agents in its Vibe CLI and Le Chat Work mode — broadening the agent harness ecosystem beyond the two dominant closed-API providers.

What this area covers

The agent and tool ecosystem encompasses everything that sits between a raw language model and a useful autonomous workflow: the protocols that let models call external tools, the harnesses that orchestrate multi-step tasks, the runtimes that persist agent state across sessions, and the benchmarks that measure whether any of it actually works. It is the infrastructure layer that determines whether a capable model can be deployed as a reliable actor in the world — not just a responder in a chat window.

Why it matters

The shift from chat to agency is the central commercial and technical story of the current AI cycle. Claude Code alone accounts for an estimated 4% of all GitHub public commits worldwide and generates over $2.5 billion in annualized run-rate revenue. OpenAI's Codex competes directly. Mistral, Meta, DeepSeek, and the open-weights community are all building agent-capable models and harnesses. The tooling layer is where capability translates into economic output — and where the most consequential security failures are now occurring.

Phase 1: From chat to tool use (2022–2024)

The substrate was established by ChatGPT's November 2022 launch, which demonstrated that large language models could be made reliably interactive. The next step was giving models access to external state. OpenAI's o1 series (September 2024) introduced inference-time chain-of-thought reasoning as a first-class capability, and o3/o4-mini (April 2025) shipped with full tool access — the first time a reasoning model and agentic tool use were tightly integrated in a single OpenAI release. The pattern was set: reasoning at inference time plus tool calls equals the minimal viable agent.

Phase 2: Computer use and the first coding agents (mid-2025)

Anthropic's computer use beta (July 2025) extended tool access from structured APIs to unstructured desktop environments — models could now view screens, move cursors, click, and type. Early adopters included Replit, The Browser Company, and Cognition. Two months later, Claude Code went generally available with GitHub Actions, VS Code, and JetBrains integrations, alongside a new MCP connector and Files API in the Anthropic API. Claude 3.7 Sonnet, released the same day, was the first hybrid reasoning model — capable of switching between near-instant and extended thinking within a single call — and achieved state-of-the-art on SWE-bench Verified and TAU-bench. The coding agent category had arrived.

Phase 3: Protocol standardization and the SDK layer (late 2025 – early 2026)

The release of Claude Sonnet 4.5 (November 2025) marked a qualitative shift in the developer surface. The Claude Agent SDK gave practitioners direct access to the same infrastructure powering Claude Code — including context editing, memory tools, and checkpoint management — rather than requiring them to build agent scaffolding from scratch. OSWorld performance jumped from 42.2% (Sonnet 4) to 61.4% (Sonnet 4.5), a proxy for how reliably the model could navigate real desktop environments. Enterprise customers including Cursor, GitHub Copilot, Devin, Canva, and Figma reported measurable gains.

In parallel, OpenAI and AWS announced a stateful agent runtime on Amazon Bedrock (March 2026) — a persistent environment managing agent memories, tool connections, and user permissions across sessions. The legal architecture was notable: the deal exploited a distinction between stateful runtimes (AWS) and stateless API calls (Microsoft Azure), allowing OpenAI to diversify its cloud relationships while honoring existing exclusivity terms. This mirrors Anthropic's own multi-cloud distribution across AWS, Google Cloud Vertex AI, and Microsoft Foundry.

Anthropic open-sourced the Model Context Protocol in May 2026, formalizing what had been an internal standard into a universal client-server specification. MCP replaces the previous pattern of per-source integrations with a single protocol covering GitHub, Slack, Google Drive, Postgres, and more. Early adopters include Block, Apollo, Zed, Replit, Codeium, and Sourcegraph. The open-sourcing signals an intent to make MCP the HTTP of AI tool connectivity — a bet that standardization benefits Anthropic more than proprietary lock-in.

Phase 4: Agentic security as a first-order problem (2025–2026)

The same capabilities that make agents useful make them dangerous when misused. In November 2025, Anthropic detected and disrupted what it describes as the first documented large-scale AI-orchestrated cyberattack: a Chinese state-sponsored actor used Claude Code as an autonomous agent — accessing tools via MCP — to conduct reconnaissance, exploit vulnerabilities, harvest credentials, and exfiltrate data across roughly thirty targets in tech, finance, chemical manufacturing, and government. The attackers bypassed safety measures by decomposing malicious tasks into seemingly innocent subtasks and framing them as defensive security testing.

A subsequent analysis of 832 accounts banned for malicious cyber activity (March 2025 – March 2026) found that medium-or-higher-risk actors grew from 33% to 56% of the population, that AI use is shifting from initial-access techniques toward post-compromise operations like lateral movement and privilege escalation, and that MITRE ATT&CK lacks coverage for agentic orchestration behaviors — where AI chains attack stages autonomously with minimal human input. The framework gap is not academic: it means defenders lack a shared vocabulary for the highest-risk threat class.

Separately, Anthropic identified three Chinese AI laboratories — DeepSeek, Moonshot AI, and MiniMax — conducting coordinated distillation attacks specifically targeting Claude's agentic reasoning, tool use, and chain-of-thought capabilities, generating over 16 million exchanges through approximately 24,000 fraudulent accounts. The attacks targeted the most differentiated parts of the agent stack, not general language capability.

The open-weights and multi-lab landscape

The agent harness ecosystem is not a two-player game. OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0 with strong tool-use capabilities optimized for consumer hardware. Mistral launched remote cloud coding agents in its Vibe CLI and a multi-step agentic Work mode in Le Chat, backed by Mistral Medium 3.5 (128B, 77.6% on SWE-Bench Verified) and Mistral Small 4 (119B MoE, Apache 2.0). DeepSeek V4-Flash (284B total, 13B active) powers the Goedel-Architect agentic theorem-proving framework, which achieved 99.2% pass@1 on MiniF2F-test at up to 500x lower cost than comparable systems. Meta's Muse Spark introduced a "contemplating mode" running multiple agents in parallel — a multi-agent orchestration pattern at the model level rather than the harness level.

Andrew Ng's OpenCoworker (June 2026) represents the open-source end of the harness spectrum: a free desktop agent built on aisuite that works with any API key or local model, explicitly designed for privacy-preserving agentic workflows. Its release coincides with a broader observation in the events bundle that frontier models have become capable enough to reliably drive next-action decisions — the bottleneck has moved from model capability to harness design and infrastructure.

Where it is heading

The consolidation pattern is clear: stateless API calls are giving way to stateful agent runtimes; per-source integrations are giving way to protocol standards like MCP; and purpose-built coding agents are displacing general-purpose chat as the primary commercial surface. The unresolved questions are governance ones. Existing security frameworks (MITRE ATT&CK) do not cover agentic orchestration. Existing deployment patterns (stateless, session-less) do not map to agents that run for hours with persistent memory and broad tool access. The labs building the most capable agents are also the ones most actively publishing threat analyses — a dynamic that will shape how the infrastructure layer evolves as model capability continues to advance.

Agent and tool ecosystem: layers and key players

Major agent harnesses and tool-use platforms (from the events bundle)

ProductDeveloperKey capabilityStatus / availabilityNotable integrations
Claude CodeAnthropicAutonomous coding agent; file read/write, test execution, GitHub pushGA (Sep 2025)GitHub Actions, VS Code, JetBrains, Excel, Chrome
Claude Agent SDKAnthropicDeveloper access to Claude Code's agent infrastructure; context editing, memory toolsGA (Nov 2025)Cursor, GitHub Copilot, Devin, Canva, Figma
CodexOpenAICoding agent powered by GPT-5.4Available (Mar 2026)OpenAI API
OpenAI Stateful Agent RuntimeOpenAI + AWSStateful runtime managing memories, tool connections, permissions on BedrockAnnounced (Mar 2026)Amazon Bedrock
Vibe CLI / Le Chat Work modeMistral AIRemote async coding agents; cross-tool agentic workflowsLaunched (Apr 2026)Email, calendar, issue tracking
OpenCoworkerAndrew Ng / aisuiteOpen-source desktop agent harness; privacy-preserving, own API keys or local modelsFree / open-source (Jun 2026)aisuite, local models
Model Context Protocol (MCP)AnthropicUniversal client-server protocol for AI-to-tool connectionsOpen standard (May 2026)GitHub, Slack, Google Drive, Postgres, Zed, Replit, Codeium, Sourcegraph

All cells sourced from the events bundle; unknown cells render —.

Timeline

  1. ChatGPT launches — first mass-market interactive LLM, establishing the conversational substrate agents would later build on

  2. Anthropic ships computer use in public beta — models can view screens, move cursors, click, and type

  3. Claude Code goes GA with GitHub Actions, VS Code, JetBrains; MCP connector and Files API added to Anthropic API

  4. First documented large-scale AI-orchestrated cyberattack disrupted — attackers used Claude Code via MCP for autonomous reconnaissance and exfiltration

  5. Claude Agent SDK released — developers get access to the same infrastructure powering Claude Code; OSWorld score reaches 61.4%

  6. OpenAI and AWS announce stateful agent runtime on Bedrock — persistent memory, tool connections, and permissions for agents

  7. Anthropic open-sources MCP — universal client-server protocol replacing fragmented per-source integrations

  8. OpenCoworker released as free open-source desktop agent harness; Anthropic ships Claude Mythos 5 and Fable 5 at new capability tier

Related topics

FAQ

What is the Model Context Protocol (MCP) and why does it matter?

MCP is an open standard released by Anthropic that gives AI agents a single, universal client-server interface to external tools and data sources — replacing the previous pattern of writing a bespoke integration for every service. Early adopters include Block, Apollo, Zed, Replit, Codeium, and Sourcegraph.

What is a stateful agent runtime and how does it differ from a regular API call?

A stateless API call processes a prompt and returns a response with no memory of prior interactions; a stateful agent runtime persists the agent's working state — memories, tool connections, user permissions — across many steps and sessions. OpenAI and AWS announced exactly this architecture on Amazon Bedrock in March 2026.

How significant is Claude Code commercially?

By early 2026, Claude Code alone was generating over $2.5 billion in annualized run-rate revenue and was estimated to account for roughly 4% of all GitHub public commits worldwide, making it one of the fastest-growing products in Anthropic's portfolio.

What security risks have emerged from agentic AI?

In November 2025, Anthropic detected and disrupted the first documented large-scale AI-orchestrated cyberattack, in which a state-sponsored actor used Claude Code via MCP to autonomously conduct reconnaissance, exploit vulnerabilities, and exfiltrate data across roughly thirty targets. A subsequent analysis of 832 banned accounts found that agentic orchestration behaviors — where AI chains attack stages with minimal human input — are not covered by the existing MITRE ATT&CK framework.

Is the agent harness ecosystem dominated by closed APIs?

No longer exclusively. OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0 with strong tool-use capabilities, Mistral released open-weight models with agent-oriented features, and Andrew Ng's OpenCoworker is a free open-source desktop agent harness that works with any API key or local model.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on Agent and Tool Ecosystem (6)

7Google Deepmind Blog·1mo ago·source ↗

AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields

DeepMind published a blog post detailing the real-world impact of AlphaEvolve, a Gemini-powered coding agent designed to discover and optimize algorithms. The post covers applications spanning business operations, infrastructure, and scientific research. AlphaEvolve represents a deployment of LLM-driven evolutionary algorithm search at scale across multiple domains.

5Ai Snake Oil·1mo ago·source ↗

Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX

This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

5Google Deepmind Blog·1mo ago·source ↗

Enabling a new model for healthcare with AI co-clinician

DeepMind has published a blog post outlining research into an AI co-clinician concept aimed at augmenting clinical care. The post describes a vision for AI-augmented healthcare where AI systems work alongside medical professionals. The content appears to be a high-level research direction announcement rather than a specific model or product release.

7Openai Blog·1mo ago·source ↗

Databricks brings GPT-5.5 to enterprise agent workflows

Databricks is integrating GPT-5.5 into its enterprise agent workflows following the model's state-of-the-art performance on the OfficeQA Pro benchmark. The partnership represents a deployment of OpenAI's latest model within a major data and AI platform. This signals continued enterprise adoption of frontier models for agentic use cases.

5Latent Space·1mo ago·source ↗

AI-Native Healthcare: Abridge on 100M Doctor Visits, Clinician Time Savings, and Prior Auth Automation

Latent Space interviews Abridge co-founders Janie Lee and Chai Asawa about their AI-native healthcare platform that has processed 100 million doctor visits. The system converts patient-clinician conversations into structured clinical documentation, reportedly saving clinicians 10-20 hours per week. The platform also automates prior authorization workflows, reducing turnaround from days to minutes.