5arXiv cs.AI (Artificial Intelligence)·47h ago

DeepSWIP: Counterfactual reasoning for neural probabilistic logic programs via quotient-WMC

DeepSWIP introduces a single-world counterfactual semantics for DeepProbLog, enabling causal inference over neurosymbolic programs that combine neural perception with probabilistic logic. The approach uses neural materialization to reduce neural predicates to standard ProbLog choices, then applies Single World Intervention Programs (SWIPs) and weighted model counting to compute exact counterfactuals from a single transformed program. Experiments on MPI3D validate the method against a DeepTwin construction across 12,000 queries and show a 2.14× inference speedup, while a SUMO HOV experiment demonstrates that neural calibration degradation biases plug-in causal estimates and that a correctly scoped AIPW estimator removes most first-order bias.

Evaluation and Benchmarking AI Safety Research DeepSWIP MPI3D DeepProbLog Single World Intervention Programs

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·11d ago·source ↗

WorldKernel: Formalizing world models as coupling kernels over counterfactual worlds

A new arXiv preprint identifies a structural failure mode in prediction-based world models: strong predictors can recover the diagonal of a counterfactual coupling kernel (ordinary posteriors) but systematically fail on off-diagonal cross-world couplings, collapsing to point estimates that are sometimes provably inadmissible. The authors formalize a world model as a positive semidefinite kernel K(T,T') over admissible possible worlds, showing the off-diagonal encodes counterfactual structure that more data cannot resolve. They demonstrate that PSD constraints provide partial identification bounds computable in polynomial time, that ontological axioms tighten these bounds, and that targeted constraint learning ('scars') closes the gap faster than untargeted approaches. The work has implications for causal reasoning in AI systems and the theoretical limits of learned world models.

Evaluation and Benchmarking AI Safety Research WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

6arXiv · cs.AI·25d ago·source ↗

VeriTrace: Cognitive-Graph Framework with Explicit Regulatory Loops for Deep Research Agents

VeriTrace introduces a cognitive-graph framework for deep research agents that replaces implicit LLM reasoning over intermediate representations with three explicit regulatory loops: interpretive update, deviation feedback, and schema revision. The system addresses contamination and error propagation in evolving mental models during complex multi-step research tasks. Using Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench Insight and 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DeepResearch Bench.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 cognitive-graph DeepResearch Bench +4 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

Adaptive Parallel Reasoning: The Next Paradigm in Efficient Inference Scaling

A BAIR blog post surveys recent progress in parallel reasoning for LLMs, covering methods from simple self-consistency and Best-of-N sampling through structured search (Tree of Thoughts, MCTS) to newer adaptive approaches including ParaThinker, GroupThink, and Hogwild! Inference. The core motivation is that sequential reasoning scales linearly with exploration depth, causing latency, context-rot, and compute inefficiency. Adaptive parallel reasoning aims to let models themselves decide when and how to decompose tasks into concurrent threads, rather than imposing fixed parallel structure externally. The post frames this as an emerging inference-time scaling paradigm with implications for agentic and complex reasoning workloads.

Long Context Evolution Frontier Model Releases ParaThinker Berkeley AI Research (BAIR)DeepSeek V4 +11 more

6arXiv · cs.AI·11d ago·source ↗

AHA-WAM: Asynchronous world-action modeling with temporal decoupling for robot manipulation

AHA-WAM introduces a dual Diffusion Transformer architecture that decouples world prediction (low-frequency) from action execution (high-frequency) in robot manipulation policies, addressing the inefficiency of existing world-action models that force both branches to operate at the same temporal resolution. The system uses a rolling key-value memory video DiT as a long-horizon scene planner and a fast action DiT that queries layerwise latent context via joint attention, with Observation-Guided Video-Context Routing enabling asynchronous execution. On RoboTwin benchmarks, AHA-WAM achieves 92.80% average success and 78.3% on real-world tasks at 24.17 Hz, a 4.59x speedup over Fast-WAM, without robot-data pretraining.

Inference Economics RoboTwin Linear Diffusion Transformer Observation-Guided Video-Context Routing +2 more

4arXiv · cs.AI·11d ago·source ↗

IA-VQC-DPC: Intervention-aware quantum predictive control with safety attribution for learned policies

A new arXiv preprint introduces Intervention-Aware Variational Quantum Differentiable Predictive Control (IA-VQC-DPC), a framework that trains variational quantum circuit policies under a primal-dual intervention budget to penalize over-reliance on downstream safety filters (Control-Barrier-Function projections). The work also proposes a safety-attribution protocol that decomposes trajectory corrections into policy-level versus filter-level contributions, enabling measurement of whether a policy has genuinely learned safe behavior or is merely being silently repaired by its safety layer. Experiments on BOPTEST building-control emulators show the quantum policy achieves significantly lower pre-filter violations than a matched classical policy at equal parameter budget, with a notable negative result: a learned energy head is only safe when paired with a distribution-aware runtime guard.

Evaluation and Benchmarking AI Safety Research BOPTEST Intervention-Aware Variational Quantum Differentiable Predictive Control Control-Barrier-Function

7Qwen Research·1mo ago·source ↗

QwQ-32B-Preview: Alibaba's Qwen Reasoning Model with Deep Reflection Capabilities

Alibaba's Qwen team has released QwQ-32B-Preview, a 32-billion parameter model designed for deep reasoning across mathematics, code, and general knowledge. The model is positioned as a reasoning-focused system that emphasizes uncertainty and iterative questioning as core design principles. It is available on GitHub, Hugging Face, ModelScope, and via a demo interface.

Frontier Model Releases Evaluation and Benchmarking Alibaba QwQ-32B-Preview Qwen +3 more

5arXiv · cs.AI·1mo ago·source ↗

Neurosymbolic Learning for Inference-Time Argumentation in Claim Verification

This paper introduces Inference-Time Argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification (true/false/uncertain) that integrates formal argumentation semantics with LLM training. The framework uses argumentation semantics both to guide LLM training for argument generation and scoring, and to compute final predictions deterministically from explicit argumentative structures. Unlike conventional reasoning models that rely on potentially unfaithful post-hoc explanations, ITA produces verdicts that are faithful by construction to the underlying arguments. Experiments on two ternary claim verification datasets show ITA outperforms argumentative baselines and competes with non-argumentative direct-prediction approaches.

Evaluation and Benchmarking AI Safety Research large language models Inference-Time Argumentation (ITA)ternary claim verification +3 more

6arXiv · cs.CL·24d ago·source ↗

Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference

PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.

Long Context Evolution Inference Economics On-Policy Distillation (OPD)Multi-Token Prediction (MTP)speculative decoding +7 more