6arXiv cs.CL (Computation and Language)·1mo ago

Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems

This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.

Evaluation and Benchmarking AI Safety Research Enterprise Deployment Patterns Agent and Tool Ecosystem embodied agents large language models Code as Agent Harness multi-agent coordination multi-agent systems GUI/OS automation execution-based verification

Related guides (4)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

3Github Trending·22d ago·source ↗

Awesome Harness Engineering: Curated List for AI Agent Infrastructure

A GitHub repository aggregating resources on AI agent harness engineering, covering tools, patterns, evaluations, memory systems, MCP (Model Context Protocol), permissions, observability, and orchestration. The list has accumulated 1,318 stars with 39 added today, indicating moderate community traction. It serves as a reference index rather than original research or tooling.

Evaluation and Benchmarking Agent and Tool Ecosystem ai-boost/awesome-harness-engineering Model Context Protocol

6arXiv · cs.LG·25d ago·source ↗

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper argues that the next major bottleneck in agentic AI is system-level design—what the authors call 'scaling the harness'—rather than continued model scaling alone. The agent harness encompasses memory substrates, context constructors, skill-routing layers, orchestration loops, and verification/governance components that together translate model capability into long-horizon behavior. The authors identify three core bottlenecks (context governance, trustworthy memory, dynamic skill routing) and propose harness-level benchmarks measuring trajectory quality, memory hygiene, and verification cost. They introduce CheetahClaws, a Python-native reference harness, and compare it against Claude Code and OpenClaw.

Evaluation and Benchmarking Inference Economics SafeRL-Lab dynamic skill routing Scaling the Harness (paper)+8 more

7arXiv · cs.CL·8d ago·source ↗

Recursive Agent Harnesses (RAH): harness recursion extends model recursion for long-context coding agents

A new arXiv preprint introduces the Recursive Agent Harness (RAH), a pattern where a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. The authors frame this as 'harness recursion', a code-first extension of model recursion from recursive language models. Evaluated on the Oolong-Synthetic long-context benchmark, RAH improves over the Codex coding-agent baseline from 71.75% to 81.36% with GPT-5 as backbone, and reaches 89.77% with Claude Sonnet 4.5. The work connects emerging production patterns (e.g., Anthropic's dynamic workflows) to a formal architectural concept.

Long Context Evolution Evaluation and Benchmarking Claude Sonnet 4.5 Oolong-Synthetic Recursive Agent Harnesses +4 more

4One Useful Thing·1mo ago·source ↗

Claude Code and What Comes Next

A commentary piece from One Useful Thing examining Claude Code and its implications for AI-assisted software development. The author reflects on what agentic coding tools can accomplish with the right scaffolding and considers near-term trajectories. Published in early January 2026, this represents a tier-2 analyst perspective on Anthropic's coding agent product.

Enterprise Deployment Patterns Agent and Tool Ecosystem Ethan Mollick Claude Code Anthropic

4Don'T Worry About The Vase·1mo ago·source ↗

Claude Code, Codex and Agentic Coding #8

Zvi Mowshowitz's eighth installment in his ongoing series tracking the agentic coding landscape, covering developments around Claude Code and OpenAI Codex. As a tier-2 commentary source, the piece synthesizes recent progress and trends in coding agents. The series has been running since the initial wave of excitement around coding agents.

Frontier Model Releases Agent and Tool Ecosystem Claude Code OpenAI OpenAI Codex +2 more

5Hugging Face Blog·1mo ago·source ↗

CodeAgents + Structure: A Better Way to Execute Actions

Hugging Face published a blog post exploring the combination of code-based agents with structured outputs to improve action execution reliability. The post examines how enforcing structured generation can reduce errors and improve the robustness of agentic code execution pipelines. This represents a practical engineering approach to making code agents more dependable in production settings.

Inference Economics Agent and Tool Ecosystem structured output generation Hugging Face CodeAgents +1 more

4Github Trending·25d ago·source ↗

shareAI-lab/learn-claude-code: Minimal Claude Code-style Agent Harness in Python

A GitHub repository implementing a minimal 'nano' version of a Claude Code-style agent harness built from scratch in Python, using Bash as the primary tool interface. The project has accumulated 62,802 stars with 262 added today, indicating significant community interest. It serves as an educational resource for understanding how agentic coding assistants like Claude Code are structured at a low level.

Agent and Tool Ecosystem shareAI-lab Claude Code shareAI-lab/learn-claude-code +1 more

5arXiv · cs.AI·24d ago·source ↗

Governed Evolution of Agent Runtimes through Executable Operational Cognition

This paper proposes a framework for governed runtime evolution in multi-agent systems, formalizing agent-generated code artifacts as persistent runtime capabilities rather than transient outputs. It introduces HarnessMutation, a lifecycle-aware mechanism for runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. The framework models agent self-modification as a bounded, observable, and auditable process over persistent operational memory, building on prior 'Code as Agent Harness' work.

AI Safety Research Agent and Tool Ecosystem Executable Operational Cognition Code as Agent Harness multi-agent systems +1 more