7arXiv cs.CL (Computation and Language)·24d ago

SIA: Self-Improving AI via Joint Harness and Weight Updates

SIA proposes a self-improving loop in which a Feedback-Agent simultaneously updates both the scaffold (harness) and model weights of a task-specific agent, unifying two previously disjoint research lines: meta-agent scaffold rewriting and test-time training. The system is evaluated on three diverse benchmarks—Chinese legal charge classification, GPU kernel optimization, and single-cell RNA denoising—achieving gains of 56.6%, 91.9% runtime reduction, and 502% respectively over baselines. The paper argues that harness updates shape agentic behavior while weight updates instill domain intuition that prompting alone cannot provide, and that combining both levers consistently outperforms either alone.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF LawBench SIA (Self Improving AI)harness update test-time training Feedback-Agent

Related guides (4)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Read asIn-depth

Related events (8)

4Github Trending·9d ago·source ↗

SIA: Self-Improving AI framework for autonomous benchmark performance improvement

SIA (Self Improving AI) is an open-source Python framework from hexo-ai designed to autonomously improve the performance of AI models or agents on benchmark tasks. The repository is trending on GitHub with 1,228 total stars and 177 new stars today. The framework targets a core challenge in AI development: automated self-improvement loops without human intervention.

Evaluation and Benchmarking Agent and Tool Ecosystem SIA hexo-ai

6arXiv · cs.LG·25d ago·source ↗

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper argues that the next major bottleneck in agentic AI is system-level design—what the authors call 'scaling the harness'—rather than continued model scaling alone. The agent harness encompasses memory substrates, context constructors, skill-routing layers, orchestration loops, and verification/governance components that together translate model capability into long-horizon behavior. The authors identify three core bottlenecks (context governance, trustworthy memory, dynamic skill routing) and propose harness-level benchmarks measuring trajectory quality, memory hygiene, and verification cost. They introduce CheetahClaws, a Python-native reference harness, and compare it against Claude Code and OpenClaw.

Evaluation and Benchmarking Inference Economics SafeRL-Lab dynamic skill routing Scaling the Harness (paper)+8 more

5Interconnects·1mo ago·source ↗

Lossy self-improvement

This commentary from Interconnects argues that AI self-improvement is a real phenomenon but that inherent lossiness in the process prevents it from leading to fast takeoff scenarios. The piece appears to engage with the debate over recursive self-improvement and its implications for AI risk timelines. It offers a nuanced middle-ground position: acknowledging self-improvement capability while contesting the discontinuous-growth narrative common in AI safety discourse.

Frontier Model Releases AI Safety Research Interconnects Recursive Self-Improvement fast takeoff

3Github Trending·22d ago·source ↗

Awesome Harness Engineering: Curated List for AI Agent Infrastructure

A GitHub repository aggregating resources on AI agent harness engineering, covering tools, patterns, evaluations, memory systems, MCP (Model Context Protocol), permissions, observability, and orchestration. The list has accumulated 1,318 stars with 39 added today, indicating moderate community traction. It serves as a reference index rather than original research or tooling.

Evaluation and Benchmarking Agent and Tool Ecosystem ai-boost/awesome-harness-engineering Model Context Protocol

5arXiv · cs.CL·9d ago·source ↗

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.

Evaluation and Benchmarking Inference Economics SWE-Bench Multilingual OpenClaw SWE-Bench Verified +4 more

6arXiv · cs.CL·1mo ago·source ↗

Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems

This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.

Evaluation and Benchmarking AI Safety Research embodied agents large language models Code as Agent Harness +6 more

5arXiv · cs.CL·11d ago·source ↗

SIGA: Self-evolving grounding adapters enable coding agents to operate scientific simulators

SIGA (Simulator-Interface Grounding Adapter) is a lightweight adapter framework that equips general-purpose coding agents with the executable contracts needed to configure and run specialized scientific simulators. Evaluated primarily on GEOS (a multiphysics subsurface simulator), SIGA achieves a ~36x wall-clock speedup over human experts and improves TreeSim scores from 0.720 to 0.789 on held-out tasks, with self-evolution via trajectory rewriting yielding further gains. The system also transfers to OpenFOAM and LAMMPS, revealing that the dominant grounding mechanism (validation vs. memory/retrieval) shifts depending on the interface type. The work frames simulator setup as an agent-tool interface grounding problem, offering a generalizable pattern for deploying coding agents on domain-specific software.

Evaluation and Benchmarking Agent and Tool Ecosystem TreeSim GEOS OpenFOAM +2 more

7arXiv · cs.CL·25d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more