Verifiable Belief-Space Neural Safety Filters for Interactive Robotics via Conformal Prediction
This paper proposes an algorithmic framework to certify high-probability safety of belief-space safety filters (BeliefSF) in interactive robotics, addressing the challenge that neural approximations and runtime inference errors make formal guarantees difficult. The approach uses conformal prediction focused on regions where inference is reliable, preserving standard sample complexity while certifying a less conservative filter. Evaluation on a simulated human-vehicle interaction benchmark demonstrates the method produces significantly more permissive safety guarantees than a standard conformal prediction baseline.
Related guides (2)
Related events (8)
SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLMs via RL-Driven Prompt Optimisation
SafeCtrl-RL is a framework for controlling LLM safety at inference time without retraining or modifying model parameters. It formulates dialogue generation as a sequential decision process where an RL agent dynamically selects prompt adjustment strategies based on contextual feedback, iteratively suppressing unsafe outputs. The authors frame this as 'inference-time behavioural unlearning' and report improvements in safety and response quality across multiple LLMs and unsafe dialogue scenarios, outperforming existing prompt-based optimisation baselines.
IA-VQC-DPC: Intervention-aware quantum predictive control with safety attribution for learned policies
A new arXiv preprint introduces Intervention-Aware Variational Quantum Differentiable Predictive Control (IA-VQC-DPC), a framework that trains variational quantum circuit policies under a primal-dual intervention budget to penalize over-reliance on downstream safety filters (Control-Barrier-Function projections). The work also proposes a safety-attribution protocol that decomposes trajectory corrections into policy-level versus filter-level contributions, enabling measurement of whether a policy has genuinely learned safe behavior or is merely being silently repaired by its safety layer. Experiments on BOPTEST building-control emulators show the quantum policy achieves significantly lower pre-filter violations than a matched classical policy at equal parameter budget, with a notable negative result: a learned energy head is only safe when paired with a distribution-aware runtime guard.
VLESA: Vision-Language Embodied Safety Agent for Real-Time Human Activity Monitoring
Researchers introduce VLESA, a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. The system addresses intent-dependent safety — where identical actions can be safe or dangerous depending on context — using a goal-conditioned safety Q-filter trained via GRPO and an intent-action prediction agent. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy than baselines, with the Q-filter improving action safety by over 41 percentage points through goal-conditioned constrained decoding.
Improving Model Safety Behavior with Rule-Based Rewards
OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.
Deliberative Alignment: Reasoning Enables Safer Language Models
OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.
Safety Gym: OpenAI Releases RL Safety Constraint Benchmark Suite
OpenAI released Safety Gym, a suite of environments and tools designed to measure progress in training reinforcement learning agents that respect safety constraints during training. The toolkit targets the challenge of constrained RL, where agents must optimize objectives without violating specified safety boundaries. This represents an early formal effort by OpenAI to provide standardized benchmarking infrastructure for safe RL research.
AI Safety via Debate
OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.
VERITAS: Visual verification enables inference-time steering and autonomous improvement for robot policies
Researchers introduce VERITAS, a generator-verifier framework pairing a pre-trained generalist robot policy with a gradient-free visual verifier to steer actions at inference time without additional training. Verified rollouts are also used for offline self-improvement via fine-tuning, achieving performance gains comparable to expert demonstrations but without human intervention. The work demonstrates that inference-time verification is a scalable mechanism for autonomous policy improvement during deployment.

