5arXiv cs.CL (Computation and Language)·47h ago

H-RePlan: Hierarchical recovery framework for multi-device computer-use agents

Researchers introduce H-RePlan, a hierarchical replanning framework for agents operating across multiple devices (Linux and Android) with unified API-CLI-GUI execution. The system separates device-local strategy recovery from orchestrator-level global replanning via a cross-layer failure abstraction, enabling finer-grained fault handling than existing retry or reassignment approaches. A companion benchmark, HeraBench, injects strategy- and device-level failures into cross-device workflows to evaluate recovery capability. Experiments show H-RePlan outperforms single-strategy and coarse-grained baselines on completion, instruction adherence, and token efficiency.

Evaluation and Benchmarking Agent and Tool Ecosystem HeraBench H-RePlan Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

Reversa: A Multi-Agent Framework for Reverse Engineering Legacy Software into AI-Readable Operational Specifications

Reversa is a multi-agent pipeline framework that converts legacy software systems into traceable operational specifications suitable for use by AI coding agents. The framework employs specialized agents for surface mapping, module analysis, implicit rule extraction, architecture synthesis, and specification review, with mechanisms for traceability, confidence marking, and gap preservation. An exploratory case study on migrating an ATM system from COBOL to Go produced 517 confidence-indexed claims, 53 Gherkin parity scenarios, and a partial reconstruction plan, though final validation was not completed. The system is distributed as a Node.js CLI and is positioned relative to literature on reverse engineering, LLM-based documentation, and software agents.

Enterprise Deployment Patterns Agent and Tool Ecosystem SHA-256 Go (programming language)Gherkin +3 more

7arXiv · cs.CL·8d ago·source ↗

Recursive Agent Harnesses (RAH): harness recursion extends model recursion for long-context coding agents

A new arXiv preprint introduces the Recursive Agent Harness (RAH), a pattern where a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. The authors frame this as 'harness recursion', a code-first extension of model recursion from recursive language models. Evaluated on the Oolong-Synthetic long-context benchmark, RAH improves over the Codex coding-agent baseline from 71.75% to 81.36% with GPT-5 as backbone, and reaches 89.77% with Claude Sonnet 4.5. The work connects emerging production patterns (e.g., Anthropic's dynamic workflows) to a formal architectural concept.

Long Context Evolution Evaluation and Benchmarking Claude Sonnet 4.5 Oolong-Synthetic Recursive Agent Harnesses +4 more

5arXiv · cs.CL·5d ago·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld Sokoban RePro +2 more

7arXiv · cs.CL·25d ago·source ↗

MobileGym: Verifiable Parallel Simulation Platform for Mobile GUI Agent Training

MobileGym is a browser-hosted simulation environment for mobile GUI agent research that enables deterministic outcome verification via structured JSON state and scalable online RL through hundreds of parallel instances (~400 MB/instance, ~3s cold start). The accompanying MobileGym-Bench provides 416 parameterized task templates across 28 apps with deterministic judges. A sim-to-real case study using GRPO on Qwen3-VL-4B-Instruct achieves +12.8 percentage points on the 256-task test set, with real-device execution retaining 95.1% of simulation-side training gains.

Evaluation and Benchmarking Inference Economics MobileGym-Bench GRPO MobileGym +6 more

6arXiv · cs.CL·25d ago·source ↗

ProAct: Proactive Agent Architecture Using Idle-Time Compute to Anticipate User Needs

ProAct is a proactive agent architecture that uses idle time between user interactions to predict upcoming needs, pre-fetch information, and resolve knowledge gaps before queries are issued. The system analyzes dialogue history and persistent memory to iteratively acquire relevant information in advance. Evaluated on the new ProActEval benchmark (200 scenarios, 40 domains), ProAct reduces required turns by 14.8%, user effort by 11.7%, and hallucination rates by 28.1% compared to reactive baselines. The work also achieves state-of-the-art reflective accuracy on MemBench.

Evaluation and Benchmarking Inference Economics ProActEval idle-time compute ProAct +3 more

6arXiv · cs.AI·10d ago·source ↗

Piper: Programmable distributed training system decoupling parallelism strategy from runtime

Researchers present Piper, a distributed training system that separates parallelism strategy specification from low-level runtime execution via an intermediate representation (IR) — a unified global training DAG. Users declare strategies through model annotations and scheduling directives, which Piper compiles into per-device execution plans. The system matches performance on standard strategies like ZeRO while enabling additional gains through joint compute-communication scheduling in composed strategies such as DeepSeek-V3's DualPipe.

Training Infrastructure Frontier Model Releases DeepSeek V4 Piper DualPipe +1 more

4arXiv · cs.AI·4d ago·source ↗

PACT: Hybrid SLM deliberation architecture improves reactive RL policies in unfamiliar environments

Researchers propose PACT (Plan, Align, Commit, Think), a hybrid architecture pairing a fast reactive RL policy with an asynchronous small language model planner for deliberation. The SLM generates and validates candidate action plans via simulation before committing to execution, bypassing the RL policy without retraining. Evaluated on FrozenLake configurations of increasing difficulty, PACT outperforms baselines using only a 2B-parameter SLM, suggesting complementary strengths between deliberative planning and reactive execution.

Agent and Tool Ecosystem PACT

4Openai Blog·1mo ago·source ↗

OpenAI Develops Hierarchical Reinforcement Learning Algorithm for Long-Horizon Tasks

OpenAI published research on a hierarchical reinforcement learning (HRL) algorithm that learns reusable high-level actions to solve tasks requiring thousands of timesteps. Applied to navigation problems, the algorithm discovers locomotion primitives (walking, crawling in various directions) that enable rapid mastery of new tasks. The approach addresses a core challenge in RL: efficient exploration and transfer across long-horizon tasks.

Agent and Tool Ecosystem OpenAI Hierarchical Reinforcement Learning