5arXiv cs.AI (Artificial Intelligence)·1mo ago

Reversa: A Multi-Agent Framework for Reverse Engineering Legacy Software into AI-Readable Operational Specifications

Reversa is a multi-agent pipeline framework that converts legacy software systems into traceable operational specifications suitable for use by AI coding agents. The framework employs specialized agents for surface mapping, module analysis, implicit rule extraction, architecture synthesis, and specification review, with mechanisms for traceability, confidence marking, and gap preservation. An exploratory case study on migrating an ATM system from COBOL to Go produced 517 confidence-indexed claims, 53 Gherkin parity scenarios, and a partial reconstruction plan, though final validation was not completed. The system is distributed as a Node.js CLI and is positioned relative to literature on reverse engineering, LLM-based documentation, and software agents.

Enterprise Deployment Patterns Agent and Tool Ecosystem SHA-256 Go (programming language)Gherkin COBOL Node.js Reversa

Related guides (2)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·11d ago·source ↗

RedAct framework protects procedural skills in agent execution traces via selective redaction and watermarking

Researchers introduce RedAct, a framework for releasing agent execution traces without exposing proprietary procedural skills (tool invocations, decision logic, error-recovery strategies). The system localizes sensitive information, rewrites traces while preserving audit-critical evidence, and embeds behavioral watermarks for provenance tracking. To evaluate the approach, the authors construct CapTraceBench, a benchmark of 75 long-horizon tasks and 154 skills across seven domains. RedAct reduces normalized skill transfer from 44.7–67.1% on raw traces to below the no-skill baseline, while watermark detection achieves 93.6–100% true positive rate with under 2% false alarms.

Evaluation and Benchmarking AI Safety Research RedAct CapTraceBench Xu Shuwen +1 more

5arXiv · cs.CL·6d ago·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld Sokoban RePro +2 more

5arXiv · cs.LG·4d ago·source ↗

ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues

ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI ReproRepo Codex +1 more

7arXiv · cs.CL·9d ago·source ↗

Recursive Agent Harnesses (RAH): harness recursion extends model recursion for long-context coding agents

A new arXiv preprint introduces the Recursive Agent Harness (RAH), a pattern where a parent agent generates executable scripts that spawn parallel subagent harnesses with filesystem tools, code execution, and planning capabilities. The authors frame this as 'harness recursion', a code-first extension of model recursion from recursive language models. Evaluated on the Oolong-Synthetic long-context benchmark, RAH improves over the Codex coding-agent baseline from 71.75% to 81.36% with GPT-5 as backbone, and reaches 89.77% with Claude Sonnet 4.5. The work connects emerging production patterns (e.g., Anthropic's dynamic workflows) to a formal architectural concept.

Long Context Evolution Evaluation and Benchmarking Claude Sonnet 4.5 Oolong-Synthetic Recursive Agent Harnesses +4 more

6arXiv · cs.AI·26d ago·source ↗

VeriTrace: Cognitive-Graph Framework with Explicit Regulatory Loops for Deep Research Agents

VeriTrace introduces a cognitive-graph framework for deep research agents that replaces implicit LLM reasoning over intermediate representations with three explicit regulatory loops: interpretive update, deviation feedback, and schema revision. The system addresses contamination and error propagation in evolving mental models during complex multi-step research tasks. Using Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench Insight and 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DeepResearch Bench.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 cognitive-graph DeepResearch Bench +4 more

7Mistral Ai News·1mo ago·source ↗

Mistral Releases Leanstral: First Open-Source Code Agent for Lean 4 Formal Verification

Mistral AI has released Leanstral, an open-source code agent built on a sparse 120B/6B-active-parameter architecture, designed specifically for formal proof engineering in Lean 4. The model targets realistic proof engineering workflows rather than isolated math competition problems, and is benchmarked on FLTEval, a new evaluation suite tied to the Fermat's Last Theorem formalization project. Leanstral is released under Apache 2.0 with a free API endpoint and MCP support, and demonstrates competitive performance against Claude Sonnet 4.6 at roughly 1/15th the cost. The release positions formal verification as a scalable alternative to human code review for high-stakes software and mathematics.

Evaluation and Benchmarking Open Weights Progress Mistral AI Claude Sonnet 4 Claude Opus 4.6 +11 more

5arXiv · cs.CL·2d ago·source ↗

H-RePlan: Hierarchical recovery framework for multi-device computer-use agents

Researchers introduce H-RePlan, a hierarchical replanning framework for agents operating across multiple devices (Linux and Android) with unified API-CLI-GUI execution. The system separates device-local strategy recovery from orchestrator-level global replanning via a cross-layer failure abstraction, enabling finer-grained fault handling than existing retry or reassignment approaches. A companion benchmark, HeraBench, injects strategy- and device-level failures into cross-device workflows to evaluate recovery capability. Experiments show H-RePlan outperforms single-strategy and coarse-grained baselines on completion, instruction adherence, and token efficiency.

Evaluation and Benchmarking Agent and Tool Ecosystem HeraBench H-RePlan Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

6arXiv · cs.CL·6d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem DeliveryBench AgentSpec MiniGrid +2 more