DeltaBox: Millisecond-Level Sandbox Checkpoint/Rollback for Stateful AI Agents
DeltaBox introduces a new OS-level abstraction called DeltaState that enables change-based (delta) checkpoint and rollback for AI agent sandboxes, rather than duplicating full state on each operation. Two co-designed OS mechanisms—DeltaFS for filesystem state and DeltaCR for process state—reduce checkpoint latency to ~14ms and rollback to ~5ms, orders of magnitude faster than existing approaches. Evaluations on SWE-bench and RL micro-benchmarks demonstrate that agents can explore substantially more nodes under fixed time budgets, directly enabling deeper test-time tree search and large-scale RL fan-outs.
Related guides (3)
Related events (8)
Execution-State Capsules: Graph-bound checkpoint/restore for low-latency on-device LLM serving
Researchers introduce execution-state capsules, a checkpoint-and-restore mechanism that snapshots the complete execution state (KV cache, recurrent state, convolution state, MTP state, and metadata) at graph boundaries rather than managing only KV fragments. The FlashRT runtime implements this on NVIDIA CUDA with sub-millisecond GPU-resident snapshot/restore, achieving TTFT speedups of 3.9x at 2k tokens and 27x at 16k tokens over cold prefill on an RTX 5090. The work targets low-latency, small-batch, on-device physical-AI scenarios—interactive agents, speech systems, robot policies—where branching, rollback, and re-entry are common. This is positioned as complementary to, not a replacement for, high-throughput KV-cache serving.
Introducing the Stateful Runtime Environment for Agents in Amazon Bedrock
OpenAI and Amazon Web Services are launching a Stateful Runtime Environment for Agents in Amazon Bedrock, enabling persistent orchestration, memory, and secure execution for multi-step AI agent workflows. The integration brings OpenAI's models into AWS's managed agent infrastructure with stateful capabilities. This represents a significant enterprise deployment partnership between two major AI ecosystem players.
Distributionally robust optimization framework for probabilistic runtime verification of AI agents
A new arXiv preprint introduces a sound and efficient framework for verifying probabilistic security policies for AI agents operating in complex digital environments, addressing limitations of prior Datalog-based approaches that assumed deterministic policies or predicate independence. The method uses distributionally robust optimization to compute sound upper bounds on policy violation probability without requiring independence assumptions between predicates. Evaluated on benchmarks for terminal and tool-calling agents, the approach outperforms prior art on the security-utility trade-off.
Giving Agents Computers — Ivan Burazin, Daytona
Latent Space interviews Daytona CEO Ivan Burazin about the company's infrastructure for giving AI agents secure compute environments. The discussion covers Daytona's bare metal sandbox architecture, 850K daily runs, 74% month-over-month growth, and their approach to RL-based evaluations for agent workloads. The piece positions Daytona as part of an emerging 'agent cloud' category providing isolated execution environments for autonomous AI systems.
MaskClaw: Edge-Side Privacy Arbitration System for GUI Agents with Behavior-Driven Skill Evolution
MaskClaw is an edge-side privacy arbitration framework for GUI agents that intercepts screenshots before they leave a trusted environment, applying Allow/Mask/Ask decisions based on local visual evidence and user-specific policy memory. The system addresses the gap where static PII detectors miss context-dependent privacy boundaries and cloud-side VLMs may upload raw screens before deciding what to protect. The authors introduce P-GUI-Evo, a new benchmark built from real UI patterns and sanitized labels, and demonstrate that pattern matching, cloud reasoning, and routing alone each exhibit systematic failure modes. The artifact is open-sourced on GitHub.
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.
DABStep: Data Agent Benchmark for Multi-step Reasoning
Hugging Face introduces DABStep, a benchmark designed to evaluate data agents on multi-step reasoning tasks. The benchmark targets agentic systems that must perform complex, sequential data operations rather than single-step queries. It aims to fill a gap in evaluation tooling for realistic data analysis workflows involving tool use and chained reasoning.
claude-mem: Persistent Cross-Session Memory Layer for AI Coding Agents
claude-mem is an open-source TypeScript library that provides persistent context across sessions for AI coding agents. It captures agent activity during sessions, compresses it using AI, and injects relevant context into future sessions. The tool claims compatibility with Claude Code, OpenAI Codex, Gemini, GitHub Copilot, and other coding agents. The repository has accumulated 78,579 stars with 319 added today, indicating strong community traction.


