5arXiv cs.AI (Artificial Intelligence)·47h ago

Distributionally robust optimization framework for probabilistic runtime verification of AI agents

A new arXiv preprint introduces a sound and efficient framework for verifying probabilistic security policies for AI agents operating in complex digital environments, addressing limitations of prior Datalog-based approaches that assumed deterministic policies or predicate independence. The method uses distributionally robust optimization to compute sound upper bounds on policy violation probability without requiring independence assumptions between predicates. Evaluated on benchmarks for terminal and tool-calling agents, the approach outperforms prior art on the security-utility trade-off.

AI Safety Research Agent and Tool Ecosystem Datalog Efficient and Sound Probabilistic Verification for AI Agents distributionally robust optimization

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·11d ago·source ↗

IA-VQC-DPC: Intervention-aware quantum predictive control with safety attribution for learned policies

A new arXiv preprint introduces Intervention-Aware Variational Quantum Differentiable Predictive Control (IA-VQC-DPC), a framework that trains variational quantum circuit policies under a primal-dual intervention budget to penalize over-reliance on downstream safety filters (Control-Barrier-Function projections). The work also proposes a safety-attribution protocol that decomposes trajectory corrections into policy-level versus filter-level contributions, enabling measurement of whether a policy has genuinely learned safe behavior or is merely being silently repaired by its safety layer. Experiments on BOPTEST building-control emulators show the quantum policy achieves significantly lower pre-filter violations than a matched classical policy at equal parameter budget, with a notable negative result: a learned energy head is only safe when paired with a distribution-aware runtime guard.

Evaluation and Benchmarking AI Safety Research BOPTEST Intervention-Aware Variational Quantum Differentiable Predictive Control Control-Barrier-Function

6arXiv · cs.AI·3d ago·source ↗

VERITAS: Visual verification enables inference-time steering and autonomous improvement for robot policies

Researchers introduce VERITAS, a generator-verifier framework pairing a pre-trained generalist robot policy with a gradient-free visual verifier to steer actions at inference time without additional training. Verified rollouts are also used for offline self-improvement via fine-tuning, achieving performance gains comparable to expert demonstrations but without human intervention. The work demonstrates that inference-time verification is a scalable mechanism for autonomous policy improvement during deployment.

Inference Economics Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement VERITAS

5Ai Snake Oil·1mo ago·source ↗

New Paper: Towards a Science of AI Agent Reliability

A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.

Evaluation and Benchmarking AI Safety Research Towards a Science of AI Agent Reliability normaltech.ai AI Snake Oil +2 more

6arXiv · cs.AI·4d ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

Evaluation and Benchmarking AI Safety Research Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations GAIA Open LLM Leaderboard +3 more

6arXiv · cs.AI·1mo ago·source ↗

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

This paper introduces the stochastic-deterministic boundary (SDB) as a foundational architectural primitive for production LLM agent runtimes, defining it as a four-part contract (proposer, verifier, commit step, reject signal) governing how LLM outputs become system actions. The authors organize agent runtime design around Coordination, State, and Control concerns, presenting a catalog of six runtime patterns applicable to conversational, autonomous, and long-horizon agents. A five-step pattern-selection methodology and diagnostic procedure mapping production failures to pattern weaknesses are contributed, along with a newly named failure mode—replay divergence—where LLM consumers of deterministic event logs produce inconsistent outputs across model versions or prompt changes. The paper argues that as model variance decreases, architectural pattern choice and SDB strength become the dominant reliability levers.

Evaluation and Benchmarking Enterprise Deployment Patterns replay divergence human-in-the-loop pattern hierarchical delegation pattern +4 more

6arXiv · cs.CL·2d ago·source ↗

GraphPO: Graph-based Policy Optimization reduces redundancy in LLM reasoning RL

GraphPO is a new reinforcement learning framework that represents reasoning rollouts as directed acyclic graphs rather than independent chains or trees, merging semantically equivalent reasoning paths into equivalence classes to share suffixes and reduce redundant exploration. The approach assigns efficiency advantages to incoming edges and correctness advantages to outgoing edges, deriving process supervision from outcome rewards. Experiments on three LLMs across reasoning and agentic search benchmarks show consistent improvements over chain- and tree-based baselines under equal token or response budgets. The method also provides theoretical guarantees on reduced advantage-estimation variance.

Frontier Model Releases Alignment and RLHF GraphPO GraphPO: Graph-based Policy Optimization for Reasoning Models

5arXiv · cs.LG·11d ago·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

Alignment and RLHF Divergence Regularized Policy Optimization GRPO PPO +1 more

7arXiv · cs.AI·23d ago·source ↗

Calibrated Collective Oversight (CCO): Scalable Oversight with Finite-Time Statistical Guarantees

This paper introduces Calibrated Collective Oversight (CCO), a framework for maintaining human oversight of agentic AI systems that may exceed human capabilities. CCO aggregates diverse scoring functions into a conservatism penalty inspired by Attainable Utility Preservation, then calibrates this penalty online via Conformal Decision Theory to ensure undesirable outcomes stay below a user-specified threshold with finite-time bounds and no distributional assumptions. Evaluated on a modified SWE-bench (adversarially misaligned agent) and MACHIAVELLI (ethical violations), CCO allows weaker overseers to constrain stronger agents while preserving reward, with empirical violation rates closely matching specified targets.

Evaluation and Benchmarking AI Safety Research Calibrated Collective Oversight (CCO)Attainable Utility Preservation Conformal Decision Theory +4 more