Recursive Agent Harnesses (RAH): harness recursion extends model recursion for long-context coding agents
A new arXiv preprint introduces the Recursive Agent Harness (RAH), a pattern where a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. The authors frame this as 'harness recursion', a code-first extension of model recursion from recursive language models. Evaluated on the Oolong-Synthetic long-context benchmark, RAH improves over the Codex coding-agent baseline from 71.75% to 81.36% with GPT-5 as backbone, and reaches 89.77% with Claude Sonnet 4.5. The work connects emerging production patterns (e.g., Anthropic's dynamic workflows) to a formal architectural concept.
Related guides (3)
Related events (8)
Recursive Language Models Offer Path To Dramatically Expand Beyond the Context Window
MIT researchers Alex L. Zhang, Tim Kraska, and Omar Khattab propose Recursive Language Models (RLMs), a framework that offloads long-context processing to an external Python REPL environment, allowing models to programmatically fetch and manage text chunks via code generation. The root model spawns submodel instances to handle subtasks, aggregating their outputs recursively. Evaluated on benchmarks requiring reasoning over documents up to 11 million tokens, RLMs substantially outperform both base models and competing agentic strategies such as retrieval and summarization agents. For example, RLM-GPT-5 achieved 91.3% on BrowseComp+ versus GPT-5's inability to produce an answer, and ~50% accuracy on OOLONG-PAIRS at 1 million tokens versus near-zero for baseline approaches.
Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems
This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.
Governed Evolution of Agent Runtimes through Executable Operational Cognition
This paper proposes a framework for governed runtime evolution in multi-agent systems, formalizing agent-generated code artifacts as persistent runtime capabilities rather than transient outputs. It introduces HarnessMutation, a lifecycle-aware mechanism for runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. The framework models agent self-modification as a bounded, observable, and auditable process over persistent operational memory, building on prior 'Code as Agent Harness' work.
Awesome Harness Engineering: Curated List for AI Agent Infrastructure
A GitHub repository aggregating resources on AI agent harness engineering, covering tools, patterns, evaluations, memory systems, MCP (Model Context Protocol), permissions, observability, and orchestration. The list has accumulated 1,318 stars with 39 added today, indicating moderate community traction. It serves as a reference index rather than original research or tooling.
Unrolling the Codex Agent Loop
OpenAI published a technical deep dive into the Codex CLI agent loop, detailing how it orchestrates models, tools, and prompts via the Responses API. The post explains the internal architecture of the agentic coding system, including how the loop manages state, tool calls, and performance. This provides concrete implementation detail on how OpenAI structures production agent workflows on top of its API primitives.
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper argues that the next major bottleneck in agentic AI is system-level design—what the authors call 'scaling the harness'—rather than continued model scaling alone. The agent harness encompasses memory substrates, context constructors, skill-routing layers, orchestration loops, and verification/governance components that together translate model capability into long-horizon behavior. The authors identify three core bottlenecks (context governance, trustworthy memory, dynamic skill routing) and propose harness-level benchmarks measuring trajectory quality, memory hygiene, and verification cost. They introduce CheetahClaws, a Python-native reference harness, and compare it against Claude Code and OpenClaw.
Reversa: A Multi-Agent Framework for Reverse Engineering Legacy Software into AI-Readable Operational Specifications
Reversa is a multi-agent pipeline framework that converts legacy software systems into traceable operational specifications suitable for use by AI coding agents. The framework employs specialized agents for surface mapping, module analysis, implicit rule extraction, architecture synthesis, and specification review, with mechanisms for traceability, confidence marking, and gap preservation. An exploratory case study on migrating an ATM system from COBOL to Go produced 517 confidence-indexed claims, 53 Gherkin parity scenarios, and a partial reconstruction plan, though final validation was not completed. The system is distributed as a Node.js CLI and is positioned relative to literature on reverse engineering, LLM-based documentation, and software agents.
Frontier coding agents use metaprogramming to handle esoteric programming languages
A new arXiv paper evaluates six LLM-based coding agents on four esoteric programming languages (including Brainfuck and Befunge-98), finding that the strongest agents—Claude Opus 4.6 and GPT-5.4 xhigh—often avoid writing the target language directly, instead generating it via Python metaprograms. Forbidding this strategy causes large performance drops, and text guidance alone does not transfer the capability to weaker models, though sharing Opus-derived Python helper code does sharply improve mid-tier agents. The study reveals capability stratification that mainstream benchmarks like SWE-Bench Verified compress into narrow bands, suggesting frontier agents succeed by constructing and debugging working models of unfamiliar environments rather than pattern-matching to training data.


