5arXiv cs.AI (Artificial Intelligence)·47h ago

Sovereign Execution Brokers: Certificate-Bound Authority Enforcement for Agentic Control Planes

This arXiv paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary that separates proposal, admission, and execution phases for autonomous agents operating in cloud and infrastructure environments. SEB consumes certificates from a companion Sovereign Assurance Boundary (SAB), verifies mutations against certified execution contracts, mints scoped short-lived identities, and produces signed audit records. The architecture addresses a gap in existing access-control and assurance systems by providing a mandatory enforcement point at the moment of infrastructure mutation. A prototype is evaluated on AWS and Kubernetes, measuring latency, revocation propagation, drift detection, and fault-injection security.

AI Safety Research Enterprise Deployment Patterns Agent and Tool Ecosystem AWS Kubernetes Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes Sovereign Assurance Boundary Sovereign Execution Broker Amazon Web Services

Related guides (4)

Amazon Web Services

Amazon Web Services: The Cloud Backbone of the AI Era

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Related events (8)

6arXiv · cs.CL·5d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem DeliveryBench AgentSpec MiniGrid +2 more

6arXiv · cs.AI·1mo ago·source ↗

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

This paper introduces the stochastic-deterministic boundary (SDB) as a foundational architectural primitive for production LLM agent runtimes, defining it as a four-part contract (proposer, verifier, commit step, reject signal) governing how LLM outputs become system actions. The authors organize agent runtime design around Coordination, State, and Control concerns, presenting a catalog of six runtime patterns applicable to conversational, autonomous, and long-horizon agents. A five-step pattern-selection methodology and diagnostic procedure mapping production failures to pattern weaknesses are contributed, along with a newly named failure mode—replay divergence—where LLM consumers of deterministic event logs produce inconsistent outputs across model versions or prompt changes. The paper argues that as model variance decreases, architectural pattern choice and SDB strength become the dominant reliability levers.

Evaluation and Benchmarking Enterprise Deployment Patterns replay divergence human-in-the-loop pattern hierarchical delegation pattern +4 more

6arXiv · cs.CL·11d ago·source ↗

CHAP: Collaborative Human-Agent Protocol for structured human-AI accountability in multi-agent deployments

Researchers from BrightbeamAI introduce CHAP (Collaborative Human-Agent Protocol), a protocol specification for formalizing human-agent collaboration in production multi-agent systems. CHAP defines shared workspaces, structured override events with diffs and rationales, non-repudiable signed approvals, and an append-only evidence log, filling a gap left by MCP (tool access) and A2A (agent-to-agent interoperability). The protocol ships with a reference implementation, conformance suite, and worked examples. It targets high-stakes deployments in domains like clinical decisions, contracts, and code where human judgment must be auditable and replayable.

AI Safety Research Agent and Tool Ecosystem BrightbeamAI Collaborative Human-Agent Protocol MCP +1 more

5arXiv · cs.AI·24d ago·source ↗

Governed Evolution of Agent Runtimes through Executable Operational Cognition

This paper proposes a framework for governed runtime evolution in multi-agent systems, formalizing agent-generated code artifacts as persistent runtime capabilities rather than transient outputs. It introduces HarnessMutation, a lifecycle-aware mechanism for runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. The framework models agent self-modification as a bounded, observable, and auditable process over persistent operational memory, building on prior 'Code as Agent Harness' work.

AI Safety Research Agent and Tool Ecosystem Executable Operational Cognition Code as Agent Harness multi-agent systems +1 more

6arXiv · cs.CL·18d ago·source ↗

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

This paper identifies a privacy vulnerability in tool-augmented language agents that speculatively issue future tool calls to reduce latency: these 'ghost tool calls' leak inferred user intent to external services before the agent commits to a branch, and cannot be unsent after the fact. The authors argue that timing—not authorization—is the core issue, and propose Speculative Tool Privacy Contracts, a runtime abstraction treating pre-commitment observation as a distinct first-class effect. A prototype runtime is implemented and twelve policies are evaluated across three corpora, finding that only issue-time argument or destination suppression/modification actually reduces inference leakage.

Inference Economics AI Safety Research tool-augmented language agents Speculative Tool Privacy Contracts Ghost Tool Calls +2 more

5Github Trending·29d ago·source ↗

Microsoft Agent Governance Toolkit: Policy Enforcement and Zero-Trust Security for Autonomous AI Agents

Microsoft has published an open-source Agent Governance Toolkit on GitHub covering policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. The toolkit claims full coverage of the OWASP Agentic Top 10 security risks. It has accumulated 1,828 stars with 113 added today, indicating active community interest. This positions Microsoft as a contributor to emerging standards for safe agentic AI deployment.

AI Safety Research Enterprise Deployment Patterns execution sandboxing zero-trust identity Microsoft +3 more

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

6arXiv · cs.AI·8d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.

Evaluation and Benchmarking Agent and Tool Ecosystem AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility AgentBeats MCP +1 more