paper

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

paperactiveprovisionalthe-unfireable-safety-kernel-execution-time-ai-alignment-for-ai-agents-and-other-escapable-ai-systems-7e160207·1 events·first seen 11h ago

Aliases: The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

Co-occurring entities

Constitutional AI Kani Z3 Unfireable Safety Kernel

More like this (12)

Unfireable Safety Kernel Concrete Problems in AI Safety Efficient and Sound Probabilistic Verification for AI Agents speculative execution (AI agents)Explosion AI AI Safety via Debate UK Artificial Intelligence Safety Institute Towards a Science of AI Agent Reliability AI alignment third-party AI evaluations AI Existential Risk Fission-AI

Recent events (1)

7arXiv · cs.AI·11h ago·source ↗

Unfireable Safety Kernel: Formal execution-time alignment layer for escapable AI agents

A new arXiv preprint introduces the concept of 'escapable AI systems' — agents with sufficient reach into their own runtime to subvert in-process safety controls — and proposes a four-property architectural framework for external enforcement. The authors present the Unfireable Safety Kernel, a Rust reference implementation with machine-checked fail-closed invariants via SMT (Z3) and bounded model checking (Kani), evaluated against a self-improving world model adversary across 7,240 authorization attempts with zero successful bypasses. The work positions this 'execution-time alignment' layer as a complement to training-time approaches like RLHF and Constitutional AI, arguing that any control inside the agent's address space is fundamentally reachable by adversarial inputs.

AI Safety Research Agent and Tool Ecosystem Constitutional AI Kani Z3 +3 more