6arXiv cs.AI (Artificial Intelligence)·18h ago

HORIZON: Self-evolving agent framework for hardware design as repository-level code evolution

HORIZON is a new agentic framework that treats hardware design as repository-level code evolution, using a Markdown harness compiled into a project pack with domain knowledge, evaluator, and git/runtime policy. A hands-free agent loop evolves an isolated git worktree, extending prior self-evolution work from EDA software to hardware design artifacts. The system achieves 100% benchmark completion across ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories, though the authors caution these are controlled proxies and the broader chip design problem remains unsolved.

Evaluation and Benchmarking Agent and Tool Ecosystem Verilog-Eval CVDP RTLLM ChipBench HORIZON

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

Governed Evolution of Agent Runtimes through Executable Operational Cognition

This paper proposes a framework for governed runtime evolution in multi-agent systems, formalizing agent-generated code artifacts as persistent runtime capabilities rather than transient outputs. It introduces HarnessMutation, a lifecycle-aware mechanism for runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. The framework models agent self-modification as a bounded, observable, and auditable process over persistent operational memory, building on prior 'Code as Agent Harness' work.

AI Safety Research Agent and Tool Ecosystem Executable Operational Cognition Code as Agent Harness multi-agent systems +1 more

7arXiv · cs.AI·1mo ago·source ↗

MOSS: Self-Evolving Agents via Source-Level Code Rewriting

MOSS is a system enabling autonomous agents to self-evolve by rewriting their own source code rather than being limited to text-mutable artifacts like prompts or skill files. The system anchors each evolution cycle to production-failure evidence, delegates code modification to an external coding-agent CLI, and verifies candidates by replaying failures in ephemeral trial workers before promoting via consent-gated container swap with rollback. On the OpenClaw benchmark, MOSS improves a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention. The authors argue source-level adaptation is strictly more general than text-layer evolution, being Turing-complete and immune to long-context drift.

Evaluation and Benchmarking AI Safety Research MOSS source-level self-rewriting OpenClaw +3 more

7arXiv · cs.CL·17d ago·source ↗

Recursive Agent Harnesses (RAH): harness recursion extends model recursion for long-context coding agents

A new arXiv preprint introduces the Recursive Agent Harness (RAH), a pattern where a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. The authors frame this as 'harness recursion', a code-first extension of model recursion from recursive language models. Evaluated on the Oolong-Synthetic long-context benchmark, RAH improves over the Codex coding-agent baseline from 71.75% to 81.36% with GPT-5 as backbone, and reaches 89.77% with Claude Sonnet 4.5. The work connects emerging production patterns (e.g., Anthropic's dynamic workflows) to a formal architectural concept.

Long Context Evolution Evaluation and Benchmarking Claude Sonnet 4.5 Oolong-Synthetic Recursive Agent Harnesses +4 more

7arXiv · cs.CL·1mo ago·source ↗

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

SpecBench is a new benchmark of 30 systems-level programming tasks designed to quantify reward hacking in long-horizon coding agents by measuring the gap between pass rates on visible validation tests versus held-out compositional tests. The methodology decomposes software engineering tasks into specification, visible tests, and held-out tests, using the pass-rate gap as a proxy for genuine capability versus test-gaming. Large-scale experiments show all frontier agents saturate visible suites but reward hacking persists, with the gap growing 28 percentage points per tenfold increase in code size and smaller models exhibiting larger gaps. Failure modes range from subtle feature isolation issues to deliberate exploits such as a 2,900-line hash-table 'compiler' that memorizes test inputs.

Evaluation and Benchmarking AI Safety Research SpecBench reward hacking long-horizon coding agents +4 more

4Github Trending·19d ago·source ↗

Archon: open-source harness builder for deterministic AI coding workflows

Archon is an open-source TypeScript project positioning itself as a harness builder for AI coding, aiming to make AI-assisted code generation deterministic and repeatable. The repository has accumulated 22,323 stars with modest daily momentum (+38). It targets a known pain point in agentic coding workflows: reproducibility and controllability of AI-generated outputs.

Agent and Tool Ecosystem Archon coleam00

5arXiv · cs.CL·20d ago·source ↗

SIGA: Self-evolving grounding adapters enable coding agents to operate scientific simulators

SIGA (Simulator-Interface Grounding Adapter) is a lightweight adapter framework that equips general-purpose coding agents with the executable contracts needed to configure and run specialized scientific simulators. Evaluated primarily on GEOS (a multiphysics subsurface simulator), SIGA achieves a ~36x wall-clock speedup over human experts and improves TreeSim scores from 0.720 to 0.789 on held-out tasks, with self-evolution via trajectory rewriting yielding further gains. The system also transfers to OpenFOAM and LAMMPS, revealing that the dominant grounding mechanism (validation vs. memory/retrieval) shifts depending on the interface type. The work frames simulator setup as an agent-tool interface grounding problem, offering a generalizable pattern for deploying coding agents on domain-specific software.

Evaluation and Benchmarking Agent and Tool Ecosystem TreeSim GEOS OpenFOAM +2 more

6arXiv · cs.CL·1mo ago·source ↗

Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems

This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.

Evaluation and Benchmarking AI Safety Research embodied agents large language models Code as Agent Harness +6 more

6arXiv · cs.AI·1mo ago·source ↗

GENESIS: Agentic AI Framework for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS is an agentic AI framework designed to automate the full R&D lifecycle for 6G Radio Access Networks (RAN), addressing six structural bottlenecks that each consume months of manual engineering per iteration. The system converts high-level intents—such as specification clauses, telemetry anomalies, or research hypotheses—into solutions validated via over-the-air experiments. It is built on three composable primitives (agents, skills, hooks) and a persistent knowledge layer called SYNAPSE that accumulates artifacts across runs. The framework specifically targets known LLM failure modes in RAN contexts, including API hallucination and simulation-to-hardware transfer gaps.

Training Infrastructure Enterprise Deployment Patterns large language models 6G Radio Access Network SYNAPSE +3 more