6arXiv cs.CL (Computation and Language)·18d ago

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

This paper identifies a privacy vulnerability in tool-augmented language agents that speculatively issue future tool calls to reduce latency: these 'ghost tool calls' leak inferred user intent to external services before the agent commits to a branch, and cannot be unsent after the fact. The authors argue that timing—not authorization—is the core issue, and propose Speculative Tool Privacy Contracts, a runtime abstraction treating pre-commitment observation as a distinct first-class effect. A prototype runtime is implemented and twelve policies are evaluated across three corpora, finding that only issue-time argument or destination suppression/modification actually reduces inference leakage.

Inference Economics AI Safety Research Agent and Tool Ecosystem tool-augmented language agents Speculative Tool Privacy Contracts Ghost Tool Calls speculative execution (AI agents)

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·47h ago·source ↗

Tool-intent stabilization analysis quantifies when streaming RAG latency hiding is possible

A new arXiv paper introduces 'tool-intent stabilization' — the point in a streaming input at which a speculative retrieval query converges to the correct result — and measures its distribution on the CRAG benchmark (1,371 questions). The authors derive a model-agnostic bound on how much tool latency can be hidden behind remaining user input, finding that at realistic operating parameters 73.9% of queries admit substantial latency hiding. The study requires no model training and validates the bound against a working streaming pipeline, also identifying query properties that predict early versus late stabilization.

Inference Economics Agent and Tool Ecosystem CRAG BM25 When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

5arXiv · cs.CL·47h ago·source ↗

LedgerAgent: Structured state ledger improves policy-adherent tool-calling agents

LedgerAgent is an inference-time method that maintains explicit task state in a separate ledger rather than leaving state reconstruction implicit in the prompt, addressing two failure modes: stale/incorrect grounding and policy-violating tool calls. The ledger is used both to render current state into the prompt and to gate environment-changing tool calls against state-dependent policy constraints. Evaluated across four customer-service domains with a mixed panel of open- and closed-weight models, LedgerAgent improves average pass^k over standard prompt-based tool-calling, with the largest gains under stricter multi-trial consistency metrics.

Enterprise Deployment Patterns Agent and Tool Ecosystem LedgerAgent LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

6arXiv · cs.CL·5d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem DeliveryBench AgentSpec MiniGrid +2 more

6arXiv · cs.AI·4d ago·source ↗

Causal auditing framework detects privacy disclosures in synthetic data without model access

A new arXiv preprint introduces a model-agnostic empirical framework for auditing synthetic data generated by LLMs and generative AI systems for privacy leakage. The framework distinguishes 'true disclosures' (direct reproduction of user data) from 'phantom disclosures' (incidental generation), using held-out control sets and statistical hypothesis testing without requiring model access, canary insertion, or shadow model training. It functions as a membership inference attack and provides empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. The approach is computationally lightweight and applicable to any synthetic data generation mechanism.

Evaluation and Benchmarking AI Safety Research Differential Privacy Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

5arXiv · cs.CL·11d ago·source ↗

RedAct framework protects procedural skills in agent execution traces via selective redaction and watermarking

Researchers introduce RedAct, a framework for releasing agent execution traces without exposing proprietary procedural skills (tool invocations, decision logic, error-recovery strategies). The system localizes sensitive information, rewrites traces while preserving audit-critical evidence, and embeds behavioral watermarks for provenance tracking. To evaluate the approach, the authors construct CapTraceBench, a benchmark of 75 long-horizon tasks and 154 skills across seven domains. RedAct reduces normalized skill transfer from 44.7–67.1% on raw traces to below the no-skill baseline, while watermark detection achieves 93.6–100% true positive rate with under 2% false alarms.

Evaluation and Benchmarking AI Safety Research RedAct CapTraceBench Xu Shuwen +1 more

5arXiv · cs.AI·47h ago·source ↗

Distributionally robust optimization framework for probabilistic runtime verification of AI agents

A new arXiv preprint introduces a sound and efficient framework for verifying probabilistic security policies for AI agents operating in complex digital environments, addressing limitations of prior Datalog-based approaches that assumed deterministic policies or predicate independence. The method uses distributionally robust optimization to compute sound upper bounds on policy violation probability without requiring independence assumptions between predicates. Evaluated on benchmarks for terminal and tool-calling agents, the approach outperforms prior art on the security-utility trade-off.

AI Safety Research Agent and Tool Ecosystem Datalog Efficient and Sound Probabilistic Verification for AI Agents distributionally robust optimization

6arXiv · cs.CL·23d ago·source ↗

MaskClaw: Edge-Side Privacy Arbitration System for GUI Agents with Behavior-Driven Skill Evolution

MaskClaw is an edge-side privacy arbitration framework for GUI agents that intercepts screenshots before they leave a trusted environment, applying Allow/Mask/Ask decisions based on local visual evidence and user-specific policy memory. The system addresses the gap where static PII detectors miss context-dependent privacy boundaries and cloud-side VLMs may upload raw screens before deciding what to protect. The authors introduce P-GUI-Evo, a new benchmark built from real UI patterns and sanitized labels, and demonstrate that pattern matching, cloud reasoning, and routing alone each exhibit systematic failure modes. The artifact is open-sourced on GitHub.

Evaluation and Benchmarking AI Safety Research visual language model GUI Agents MaskClaw +4 more

7arXiv · cs.CL·23d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more