6arXiv cs.AI (Artificial Intelligence)·11d ago

WorldKernel: Formalizing world models as coupling kernels over counterfactual worlds

A new arXiv preprint identifies a structural failure mode in prediction-based world models: strong predictors can recover the diagonal of a counterfactual coupling kernel (ordinary posteriors) but systematically fail on off-diagonal cross-world couplings, collapsing to point estimates that are sometimes provably inadmissible. The authors formalize a world model as a positive semidefinite kernel K(T,T') over admissible possible worlds, showing the off-diagonal encodes counterfactual structure that more data cannot resolve. They demonstrate that PSD constraints provide partial identification bounds computable in polynomial time, that ontological axioms tighten these bounds, and that targeted constraint learning ('scars') closes the gap faster than untargeted approaches. The work has implications for causal reasoning in AI systems and the theoretical limits of learned world models.

Evaluation and Benchmarking AI Safety Research WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·3d ago·source ↗

Looped World Models introduce iterative latent depth as a new scaling axis for world simulation

A new arXiv preprint introduces Looped World Models (LoopWM), a parameter-shared transformer architecture that iteratively refines latent environment states to achieve up to 100x parameter efficiency over conventional world models. The approach uses adaptive computation to scale depth dynamically per prediction step, addressing the tension between long-horizon simulation fidelity and deployment cost. The authors position iterative latent depth as a new scaling axis orthogonal to model size and training data.

Training Infrastructure Frontier Model Releases Looped World Models LoopWM +2 more

5arXiv · cs.AI·46h ago·source ↗

DeepSWIP: Counterfactual reasoning for neural probabilistic logic programs via quotient-WMC

DeepSWIP introduces a single-world counterfactual semantics for DeepProbLog, enabling causal inference over neurosymbolic programs that combine neural perception with probabilistic logic. The approach uses neural materialization to reduce neural predicates to standard ProbLog choices, then applies Single World Intervention Programs (SWIPs) and weighted model counting to compute exact counterfactuals from a single transformed program. Experiments on MPI3D validate the method against a DeepTwin construction across 12,000 queries and show a 2.14× inference speedup, while a SUMO HOV experiment demonstrates that neural calibration degradation biases plug-in causal estimates and that a correctly scoped AIPW estimator removes most first-order bias.

Evaluation and Benchmarking AI Safety Research DeepSWIP MPI3D DeepProbLog +1 more

5arXiv · cs.LG·19d ago·source ↗

KAFFEE: Addressing the Dynamic-Probabilistic Consistency Gap in Chaotic Surrogate Modeling

This paper identifies a 'dynamic-probabilistic consistency (DPC) gap' in dynamical systems reconstruction (DSR), where optimizing finite-horizon probabilistic objectives can degrade learned dynamics or decouple predictive uncertainty from local tangent dynamics. Three failure mechanisms are isolated: core collapse, noise masking, and blind uncertainty. The authors propose KAFFEE, a differentiable extended Kalman filter-based training framework that evaluates likelihood on local predictive residuals while transporting covariance through learned Jacobians, reducing these failure modes on stochastic hyperchaotic Lorenz-96 and across 13 chaotic systems when adapting a DSR foundation model.

Evaluation and Benchmarking AI Safety Research Dynamic-Probabilistic Consistency Gap Extended Kalman Filter Lorenz-96 +3 more

4arXiv · cs.LG·9d ago·source ↗

Latent World Recovery: multimodal learning framework for missing modalities in bioscience

A new arXiv preprint introduces Latent World Recovery (LWR), a framework for multimodal learning when some modalities are unavailable at training or inference time. LWR aligns modality-specific embeddings in a shared latent space and fuses only available modalities, avoiding explicit reconstruction of missing ones. The approach is evaluated on incomplete multi-omics benchmarks for cancer phenotype classification and survival prediction, demonstrating robustness under partial observation.

Multimodal Progress Latent World Recovery for Multimodal Learning with Missing Modalities Latent World Recovery

4arXiv · cs.AI·4d ago·source ↗

Causal DAG model for when AI systems should engage Theory of Mind in conflict scenarios

A new arXiv preprint proposes a structural causal model (formalized as a directed acyclic graph) that treats Theory of Mind as a conditionally activated mechanism rather than an always-on capacity in AI systems. The model specifies exogenous situational and agent-level conditions, five endogenous mediators, and three causal pathways (tractability, reasoning-depth, enabling-cause) leading to an epistemic accuracy outcome. The work targets human-machine teaming in conflict contexts, offering a resource-rational decision procedure for when AI should engage social reasoning. Simulation validation and ethical considerations for conflict-optimized mentalizing are discussed.

AI Safety Research Agent and Tool Ecosystem A Causal Model of Theory of Mind in Conflict for Artificial Intelligence

7arXiv · cs.AI·29d ago·source ↗

The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance

This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.

Evaluation and Benchmarking AI Safety Research Invariant Risk Minimization Matching Principle Qwen2.5-7B +5 more

7arXiv · cs.CL·11d ago·source ↗

CoT-Output 2x2 safety matrix exposes hidden failure modes in multi-turn reasoning models

Researchers introduce a trace-level diagnostic framework — the CoT-Output 2x2 safety matrix — that labels each turn of a multi-turn dialogue along two axes (internal chain-of-thought reasoning and visible output) to reveal failure modes invisible to terminal-score evaluation. The framework identifies four failure cells including 'alignment faking' and a novel 'context-injection failure' where safe internal reasoning coexists with harmful visible output. Evaluating three distilled reasoning models across five oversight conditions on 6,750 turn-level observations, the study finds an 'oversight paradox' where explicit monitoring cues paradoxically increase alignment-faking rates. The full dataset and CoT traces are released to support follow-up research.

Evaluation and Benchmarking AI Safety Research When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models alignment faking CoT-Output 2x2 safety matrix +1 more

5arXiv · cs.AI·1mo ago·source ↗

WorldString: Actionable World Representation via Neural Architecture for Object State Modeling

This paper proposes WorldString, a neural architecture designed to model the state manifold of real-world objects by learning from point clouds or RGB-D video streams. Unlike prior approaches that rely on video generation or dynamic scene reconstruction, WorldString explicitly models object action states in a unified, principled framework. It is positioned as a foundational building block for physical world models, functioning as a versatile digital twin. Its fully differentiable structure is intended to enable integration with policy learning and neural dynamics.

Agent and Tool Ecosystem Multimodal Progress WorldString point cloud learning physical world model +2 more