Almanac
← Events
5arXiv cs.AI (Artificial Intelligence)·47h ago

Sovereign Execution Brokers: Certificate-Bound Authority Enforcement for Agentic Control Planes

This arXiv paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary that separates proposal, admission, and execution phases for autonomous agents operating in cloud and infrastructure environments. SEB consumes certificates from a companion Sovereign Assurance Boundary (SAB), verifies mutations against certified execution contracts, mints scoped short-lived identities, and produces signed audit records. The architecture addresses a gap in existing access-control and assurance systems by providing a mandatory enforcement point at the moment of infrastructure mutation. A prototype is evaluated on AWS and Kubernetes, measuring latency, revocation propagation, drift detection, and fault-injection security.

Related guides (4)

Related events (8)

6arXiv · cs.CL·5d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

6arXiv · cs.AI·1mo ago·source ↗

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

This paper introduces the stochastic-deterministic boundary (SDB) as a foundational architectural primitive for production LLM agent runtimes, defining it as a four-part contract (proposer, verifier, commit step, reject signal) governing how LLM outputs become system actions. The authors organize agent runtime design around Coordination, State, and Control concerns, presenting a catalog of six runtime patterns applicable to conversational, autonomous, and long-horizon agents. A five-step pattern-selection methodology and diagnostic procedure mapping production failures to pattern weaknesses are contributed, along with a newly named failure mode—replay divergence—where LLM consumers of deterministic event logs produce inconsistent outputs across model versions or prompt changes. The paper argues that as model variance decreases, architectural pattern choice and SDB strength become the dominant reliability levers.

6arXiv · cs.CL·11d ago·source ↗

CHAP: Collaborative Human-Agent Protocol for structured human-AI accountability in multi-agent deployments

Researchers from BrightbeamAI introduce CHAP (Collaborative Human-Agent Protocol), a protocol specification for formalizing human-agent collaboration in production multi-agent systems. CHAP defines shared workspaces, structured override events with diffs and rationales, non-repudiable signed approvals, and an append-only evidence log, filling a gap left by MCP (tool access) and A2A (agent-to-agent interoperability). The protocol ships with a reference implementation, conformance suite, and worked examples. It targets high-stakes deployments in domains like clinical decisions, contracts, and code where human judgment must be auditable and replayable.

5arXiv · cs.AI·24d ago·source ↗

Governed Evolution of Agent Runtimes through Executable Operational Cognition

This paper proposes a framework for governed runtime evolution in multi-agent systems, formalizing agent-generated code artifacts as persistent runtime capabilities rather than transient outputs. It introduces HarnessMutation, a lifecycle-aware mechanism for runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. The framework models agent self-modification as a bounded, observable, and auditable process over persistent operational memory, building on prior 'Code as Agent Harness' work.

6arXiv · cs.CL·18d ago·source ↗

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

This paper identifies a privacy vulnerability in tool-augmented language agents that speculatively issue future tool calls to reduce latency: these 'ghost tool calls' leak inferred user intent to external services before the agent commits to a branch, and cannot be unsent after the fact. The authors argue that timing—not authorization—is the core issue, and propose Speculative Tool Privacy Contracts, a runtime abstraction treating pre-commitment observation as a distinct first-class effect. A prototype runtime is implemented and twelve policies are evaluated across three corpora, finding that only issue-time argument or destination suppression/modification actually reduces inference leakage.

5Github Trending·29d ago·source ↗

Microsoft Agent Governance Toolkit: Policy Enforcement and Zero-Trust Security for Autonomous AI Agents

Microsoft has published an open-source Agent Governance Toolkit on GitHub covering policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. The toolkit claims full coverage of the OWASP Agentic Top 10 security risks. It has accumulated 1,828 stars with 113 added today, indicating active community interest. This positions Microsoft as a contributor to emerging standards for safe agentic AI deployment.

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

6arXiv · cs.AI·8d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.