6arXiv cs.CL (Computation and Language)·25d ago

ProAct: Proactive Agent Architecture Using Idle-Time Compute to Anticipate User Needs

ProAct is a proactive agent architecture that uses idle time between user interactions to predict upcoming needs, pre-fetch information, and resolve knowledge gaps before queries are issued. The system analyzes dialogue history and persistent memory to iteratively acquire relevant information in advance. Evaluated on the new ProActEval benchmark (200 scenarios, 40 domains), ProAct reduces required turns by 14.8%, user effort by 11.7%, and hallucination rates by 28.1% compared to reactive baselines. The work also achieves state-of-the-art reflective accuracy on MemBench.

Evaluation and Benchmarking Inference Economics Agent and Tool Ecosystem ProActEval idle-time compute ProAct proactive agent architecture MemBench

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·24d ago·source ↗

ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents

This paper introduces ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval in memory-augmented language agents deployed for emotional support applications. The benchmark includes over 1,800 memory-augmented dialogues grounded in Maslow's hierarchy of needs, with structured mappings between emotional needs and supportive memory types. Experiments show that both embedding-based and LLM-driven retrieval paradigms fall significantly short of golden memory conditions on empathy scores, and while chain-of-thought prompting helps, a substantial performance gap remains. The work highlights a systematic gap in current agent memory systems when applied to affective rather than purely factual retrieval tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem ENPMR-Bench chain-of-thought prompting Maslow's Hierarchy of Needs +1 more

5arXiv · cs.CL·5d ago·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld Sokoban RePro +2 more

6arXiv · cs.AI·25d ago·source ↗

Claw-Anything: Benchmark for Always-On Personal Assistants with Broad Digital World Access

Claw-Anything is a new benchmark designed to evaluate LLM agents acting as always-on personal assistants with access to long-horizon activity histories, interdependent backend services, and multi-device GUI/CLI interaction. The benchmark simulates months of user activity to create complex, noisy world states and evaluates both reactive and proactive assistance. GPT-5.5 achieves only 34.5% pass@1, revealing a substantial capability gap versus prior narrower benchmarks. An accompanying automated data-generation pipeline produces 2,000 training environments and yields a 23.7% improvement over the base model.

Long Context Evolution Evaluation and Benchmarking multi-round event injection Claw-Anything large language model agents +3 more

5arXiv · cs.CL·11d ago·source ↗

RedAct framework protects procedural skills in agent execution traces via selective redaction and watermarking

Researchers introduce RedAct, a framework for releasing agent execution traces without exposing proprietary procedural skills (tool invocations, decision logic, error-recovery strategies). The system localizes sensitive information, rewrites traces while preserving audit-critical evidence, and embeds behavioral watermarks for provenance tracking. To evaluate the approach, the authors construct CapTraceBench, a benchmark of 75 long-horizon tasks and 154 skills across seven domains. RedAct reduces normalized skill transfer from 44.7–67.1% on raw traces to below the no-skill baseline, while watermark detection achieves 93.6–100% true positive rate with under 2% false alarms.

Evaluation and Benchmarking AI Safety Research RedAct CapTraceBench Xu Shuwen +1 more

5Openai Blog·1mo ago·source ↗

Moving from intent-based bots to proactive AI agents

OpenAI published a post describing a shift from traditional intent-based chatbot architectures toward proactive AI agents, in the context of a partnership or deployment with Zendesk. The piece signals OpenAI's positioning of its agent capabilities within enterprise customer service workflows. The announcement reflects a broader industry trend of replacing rule-based bots with autonomous, goal-directed AI systems.

Enterprise Deployment Patterns Agent and Tool Ecosystem intent-based bots Zendesk OpenAI +1 more

7arXiv · cs.CL·23d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more

4arXiv · cs.AI·4d ago·source ↗

PACT: Hybrid SLM deliberation architecture improves reactive RL policies in unfamiliar environments

Researchers propose PACT (Plan, Align, Commit, Think), a hybrid architecture pairing a fast reactive RL policy with an asynchronous small language model planner for deliberation. The SLM generates and validates candidate action plans via simulation before committing to execution, bypassing the RL policy without retraining. Evaluated on FrozenLake configurations of increasing difficulty, PACT outperforms baselines using only a 2B-parameter SLM, suggesting complementary strengths between deliberative planning and reactive execution.

Agent and Tool Ecosystem PACT

7arXiv · cs.AI·1mo ago·source ↗

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

This paper introduces agent just-in-time (JIT) compilation as an alternative to the sequential fetch-screenshot-execute loop used by current computer-use agents. The approach compiles natural language task descriptions directly into executable code that can include LLM calls, tool calls, and parallelization, using three components: JIT-Planner, JIT-Scheduler, and an invariant-enforcing tool protocol. Across five web applications, JIT-Planner achieves 10.4× speedup and +28% accuracy over Browser-Use, while JIT-Scheduler achieves 2.4× speedup and +9% accuracy over OpenAI CUA.

Frontier Model Releases Evaluation and Benchmarking JIT-Scheduler OpenAI CUA Browser-Use +6 more