7arXiv cs.AI (Artificial Intelligence)·29d ago

DeltaBox: Millisecond-Level Sandbox Checkpoint/Rollback for Stateful AI Agents

DeltaBox introduces a new OS-level abstraction called DeltaState that enables change-based (delta) checkpoint and rollback for AI agent sandboxes, rather than duplicating full state on each operation. Two co-designed OS mechanisms—DeltaFS for filesystem state and DeltaCR for process state—reduce checkpoint latency to ~14ms and rollback to ~5ms, orders of magnitude faster than existing approaches. Evaluations on SWE-bench and RL micro-benchmarks demonstrate that agents can explore substantially more nodes under fixed time budgets, directly enabling deeper test-time tree search and large-scale RL fan-outs.

Training Infrastructure Inference Economics Agent and Tool Ecosystem DeltaFS DeltaState SWE-bench DeltaCR DeltaBox copy-on-write

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·46h ago·source ↗

Execution-State Capsules: Graph-bound checkpoint/restore for low-latency on-device LLM serving

Researchers introduce execution-state capsules, a checkpoint-and-restore mechanism that snapshots the complete execution state (KV cache, recurrent state, convolution state, MTP state, and metadata) at graph boundaries rather than managing only KV fragments. The FlashRT runtime implements this on NVIDIA CUDA with sub-millisecond GPU-resident snapshot/restore, achieving TTFT speedups of 3.9x at 2k tokens and 27x at 16k tokens over cold prefill on an RTX 5090. The work targets low-latency, small-batch, on-device physical-AI scenarios—interactive agents, speech systems, robot policies—where branching, rollback, and re-entry are common. This is positioned as complementary to, not a replacement for, high-throughput KV-cache serving.

Training Infrastructure Inference Economics Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving DGX Spark Jetson AGX Thor +2 more

7Openai Blog·1mo ago·source ↗

Introducing the Stateful Runtime Environment for Agents in Amazon Bedrock

OpenAI and Amazon Web Services are launching a Stateful Runtime Environment for Agents in Amazon Bedrock, enabling persistent orchestration, memory, and secure execution for multi-step AI agent workflows. The integration brings OpenAI's models into AWS's managed agent infrastructure with stateful capabilities. This represents a significant enterprise deployment partnership between two major AI ecosystem players.

Inference Economics Enterprise Deployment Patterns Amazon Bedrock Stateful Runtime Environment OpenAI +2 more

5arXiv · cs.AI·46h ago·source ↗

Distributionally robust optimization framework for probabilistic runtime verification of AI agents

A new arXiv preprint introduces a sound and efficient framework for verifying probabilistic security policies for AI agents operating in complex digital environments, addressing limitations of prior Datalog-based approaches that assumed deterministic policies or predicate independence. The method uses distributionally robust optimization to compute sound upper bounds on policy violation probability without requiring independence assumptions between predicates. Evaluated on benchmarks for terminal and tool-calling agents, the approach outperforms prior art on the security-utility trade-off.

AI Safety Research Agent and Tool Ecosystem Datalog Efficient and Sound Probabilistic Verification for AI Agents distributionally robust optimization

5Latent Space·1mo ago·source ↗

Giving Agents Computers — Ivan Burazin, Daytona

Latent Space interviews Daytona CEO Ivan Burazin about the company's infrastructure for giving AI agents secure compute environments. The discussion covers Daytona's bare metal sandbox architecture, 850K daily runs, 74% month-over-month growth, and their approach to RL-based evaluations for agent workloads. The piece positions Daytona as part of an emerging 'agent cloud' category providing isolated execution environments for autonomous AI systems.

Training Infrastructure Inference Economics Daytona Ivan Burazin bare metal sandboxes +3 more

6arXiv · cs.CL·23d ago·source ↗

MaskClaw: Edge-Side Privacy Arbitration System for GUI Agents with Behavior-Driven Skill Evolution

MaskClaw is an edge-side privacy arbitration framework for GUI agents that intercepts screenshots before they leave a trusted environment, applying Allow/Mask/Ask decisions based on local visual evidence and user-specific policy memory. The system addresses the gap where static PII detectors miss context-dependent privacy boundaries and cloud-side VLMs may upload raw screens before deciding what to protect. The authors introduce P-GUI-Evo, a new benchmark built from real UI patterns and sanitized labels, and demonstrate that pattern matching, cloud reasoning, and routing alone each exhibit systematic failure modes. The artifact is open-sourced on GitHub.

Evaluation and Benchmarking AI Safety Research visual language model GUI Agents MaskClaw +4 more

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

5Hugging Face Blog·1mo ago·source ↗

DABStep: Data Agent Benchmark for Multi-step Reasoning

Hugging Face introduces DABStep, a benchmark designed to evaluate data agents on multi-step reasoning tasks. The benchmark targets agentic systems that must perform complex, sequential data operations rather than single-step queries. It aims to fill a gap in evaluation tooling for realistic data analysis workflows involving tool use and chained reasoning.

Evaluation and Benchmarking Agent and Tool Ecosystem DABStep Hugging Face

5Github Trending·25d ago·source ↗

claude-mem: Persistent Cross-Session Memory Layer for AI Coding Agents

claude-mem is an open-source TypeScript library that provides persistent context across sessions for AI coding agents. It captures agent activity during sessions, compresses it using AI, and injects relevant context into future sessions. The tool claims compatibility with Claude Code, OpenAI Codex, Gemini, GitHub Copilot, and other coding agents. The repository has accumulated 78,579 stars with 319 added today, indicating strong community traction.

Long Context Evolution Agent and Tool Ecosystem Claude Code claude-mem thedotmack +2 more