From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper argues that the next major bottleneck in agentic AI is system-level design—what the authors call 'scaling the harness'—rather than continued model scaling alone. The agent harness encompasses memory substrates, context constructors, skill-routing layers, orchestration loops, and verification/governance components that together translate model capability into long-horizon behavior. The authors identify three core bottlenecks (context governance, trustworthy memory, dynamic skill routing) and propose harness-level benchmarks measuring trajectory quality, memory hygiene, and verification cost. They introduce CheetahClaws, a Python-native reference harness, and compare it against Claude Code and OpenClaw.
Related guides (4)
Related events (8)
Awesome Harness Engineering: Curated List for AI Agent Infrastructure
A GitHub repository aggregating resources on AI agent harness engineering, covering tools, patterns, evaluations, memory systems, MCP (Model Context Protocol), permissions, observability, and orchestration. The list has accumulated 1,318 stars with 39 added today, indicating moderate community traction. It serves as a reference index rather than original research or tooling.
Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems
This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.
SIA: Self-Improving AI via Joint Harness and Weight Updates
SIA proposes a self-improving loop in which a Feedback-Agent simultaneously updates both the scaffold (harness) and model weights of a task-specific agent, unifying two previously disjoint research lines: meta-agent scaffold rewriting and test-time training. The system is evaluated on three diverse benchmarks—Chinese legal charge classification, GPU kernel optimization, and single-cell RNA denoising—achieving gains of 56.6%, 91.9% runtime reduction, and 502% respectively over baselines. The paper argues that harness updates shape agentic behavior while weight updates instill domain intuition that prompting alone cannot provide, and that combining both levers consistently outperforms either alone.
Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks
Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.
AI Scaling Myths
A commentary piece from normaltech.ai argues that AI scaling will eventually hit limits, framing the debate as a question of timing rather than whether limits exist. The piece appears to challenge prevailing optimism around continued scaling returns. Given the minimal body text, the depth of argument is unclear, but the topic directly engages the scaling laws debate central to frontier AI development.
Anthropic publishes Responsible Scaling Policy with AI Safety Level framework
Anthropic released its Responsible Scaling Policy (RSP), a formal framework of technical and organizational protocols for managing catastrophic risks from increasingly capable AI systems. The policy introduces AI Safety Levels (ASL-1 through ASL-5+), modeled on US biosafety level standards, requiring progressively stricter safety, security, and operational standards as models become more capable. Current Claude models are classified as ASL-2; ASL-3 triggers stricter deployment constraints including adversarial red-teaming requirements. The policy has been approved by Anthropic's board and is intended as a template for industry-wide adoption.
Recursive Agent Harnesses (RAH): harness recursion extends model recursion for long-context coding agents
A new arXiv preprint introduces the Recursive Agent Harness (RAH), a pattern where a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. The authors frame this as 'harness recursion', a code-first extension of model recursion from recursive language models. Evaluated on the Oolong-Synthetic long-context benchmark, RAH improves over the Codex coding-agent baseline from 71.75% to 81.36% with GPT-5 as backbone, and reaches 89.77% with Claude Sonnet 4.5. The work connects emerging production patterns (e.g., Anthropic's dynamic workflows) to a formal architectural concept.
Anthropic Releases Responsible Scaling Policy Version 3.0
Anthropic has published the third version of its Responsible Scaling Policy (RSP), a voluntary framework for mitigating catastrophic risks from increasingly capable AI systems. The update reflects two-plus years of experience with the original RSP, reinforcing what worked (ASL-3 safeguards activated in May 2025, industry adoption by OpenAI and Google DeepMind, informing early AI policy) while addressing shortcomings in accountability and transparency. The new version refines the AI Safety Level (ASL) framework and introduces new measures for decision-making transparency. Anthropic acknowledges that some elements of its original theory of change—particularly multilateral coordination and government action at higher capability thresholds—have not fully materialized as hoped.



