5arXiv cs.CL (Computation and Language)·11d ago

SIGA: Self-evolving grounding adapters enable coding agents to operate scientific simulators

SIGA (Simulator-Interface Grounding Adapter) is a lightweight adapter framework that equips general-purpose coding agents with the executable contracts needed to configure and run specialized scientific simulators. Evaluated primarily on GEOS (a multiphysics subsurface simulator), SIGA achieves a ~36x wall-clock speedup over human experts and improves TreeSim scores from 0.720 to 0.789 on held-out tasks, with self-evolution via trajectory rewriting yielding further gains. The system also transfers to OpenFOAM and LAMMPS, revealing that the dominant grounding mechanism (validation vs. memory/retrieval) shifts depending on the interface type. The work frames simulator setup as an agent-tool interface grounding problem, offering a generalizable pattern for deploying coding agents on domain-specific software.

Evaluation and Benchmarking Agent and Tool Ecosystem TreeSim GEOS OpenFOAM SIGA LAMMPS

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·2d ago·source ↗

Decoupled Search Grounding (DSG): vendor-agnostic MCP-compatible architecture for LLM agent retrieval

Researchers introduce Decoupled Search Grounding (DSG), an architecture that moves real-time search grounding outside the reasoning model via an MCP-compatible gateway, exposing provider routing, caching, and retrieval-depth as explicit controls. Evaluated across five frontier models on SimpleQA, FreshQA, and HotpotQA, DSG nearly matches native search accuracy on SimpleQA (86.1% vs. 87.7%) while achieving 91% lower search cost and 68% lower latency via a 99.4% warm-cache hit rate. In a production e-commerce deployment, DSG cuts search cost by over 98% while matching or slightly exceeding native-search accuracy. The work frames real-time grounding as an optimizable interface boundary rather than a fixed model feature, with direct relevance to MCP-based agent infrastructure.

Inference Economics Enterprise Deployment Patterns FreshQA HotpotQA Decoupled Search Grounding +3 more

6arXiv · cs.AI·24d ago·source ↗

GENESIS: Agentic AI Framework for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS is an agentic AI framework designed to automate the full R&D lifecycle for 6G Radio Access Networks (RAN), addressing six structural bottlenecks that each consume months of manual engineering per iteration. The system converts high-level intents—such as specification clauses, telemetry anomalies, or research hypotheses—into solutions validated via over-the-air experiments. It is built on three composable primitives (agents, skills, hooks) and a persistent knowledge layer called SYNAPSE that accumulates artifacts across runs. The framework specifically targets known LLM failure modes in RAN contexts, including API hallucination and simulation-to-hardware transfer gaps.

Training Infrastructure Enterprise Deployment Patterns large language models 6G Radio Access Network SYNAPSE +3 more

5arXiv · cs.CL·11d ago·source ↗

AGENTSERVESIM: Hardware-aware simulator for multi-turn LLM agent serving policies

Researchers introduce AGENTSERVESIM, a simulation framework designed to evaluate serving policies for multi-turn LLM agents without requiring dedicated accelerator hardware. The simulator models program-level execution including turn dependencies, tool-induced gaps, and KV-cache residency across HBM, host DRAM, and CXL memory hierarchies. It reproduces real-system behavior within 6% error on key performance metrics while running on commodity CPUs, enabling cost-effective exploration of scheduling, routing, and cache management policies for agentic workloads.

Training Infrastructure Inference Economics AGENTSERVESIM +1 more

7arXiv · cs.CL·24d ago·source ↗

SIA: Self-Improving AI via Joint Harness and Weight Updates

SIA proposes a self-improving loop in which a Feedback-Agent simultaneously updates both the scaffold (harness) and model weights of a task-specific agent, unifying two previously disjoint research lines: meta-agent scaffold rewriting and test-time training. The system is evaluated on three diverse benchmarks—Chinese legal charge classification, GPU kernel optimization, and single-cell RNA denoising—achieving gains of 56.6%, 91.9% runtime reduction, and 502% respectively over baselines. The paper argues that harness updates shape agentic behavior while weight updates instill domain intuition that prompting alone cannot provide, and that combining both levers consistently outperforms either alone.

Frontier Model Releases Evaluation and Benchmarking LawBench SIA (Self Improving AI)harness update +4 more

7arXiv · cs.CL·25d ago·source ↗

MobileGym: Verifiable Parallel Simulation Platform for Mobile GUI Agent Training

MobileGym is a browser-hosted simulation environment for mobile GUI agent research that enables deterministic outcome verification via structured JSON state and scalable online RL through hundreds of parallel instances (~400 MB/instance, ~3s cold start). The accompanying MobileGym-Bench provides 416 parameterized task templates across 28 apps with deterministic judges. A sim-to-real case study using GRPO on Qwen3-VL-4B-Instruct achieves +12.8 percentage points on the 256-task test set, with real-device execution retaining 95.1% of simulation-side training gains.

Evaluation and Benchmarking Inference Economics MobileGym-Bench GRPO MobileGym +6 more

4Github Trending·1mo ago·source ↗

Agent-S: Open Agentic Framework for Human-Like Computer Use

Agent-S is an open-source Python framework by Simular AI designed to enable AI agents to interact with computers in a human-like manner. The project has accumulated 11,388 GitHub stars with modest daily growth of 29 stars. It represents an entry in the growing space of computer-use agent frameworks targeting GUI and desktop automation tasks.

Open Weights Progress Agent and Tool Ecosystem Agent-S Simular AI

7Google Deepmind Blog·1mo ago·source ↗

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

DeepMind has announced SIMA 2, a successor to its Scalable Instructable Multiworld Agent, powered by Gemini and designed to think, reason, and act within interactive 3D virtual environments. The agent represents an advancement in embodied AI agents capable of operating across diverse game and simulation worlds. This builds on DeepMind's earlier SIMA work, which demonstrated generalist instruction-following agents in video game environments.

Frontier Model Releases Agent and Tool Ecosystem SIMA 2 SIMA Google DeepMind +2 more

6arXiv · cs.CL·5d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem DeliveryBench AgentSpec MiniGrid +2 more