6arXiv cs.AI (Artificial Intelligence)·3d ago

VERITAS: Visual verification enables inference-time steering and autonomous improvement for robot policies

Researchers introduce VERITAS, a generator-verifier framework pairing a pre-trained generalist robot policy with a gradient-free visual verifier to steer actions at inference time without additional training. Verified rollouts are also used for offline self-improvement via fine-tuning, achieving performance gains comparable to expert demonstrations but without human intervention. The work demonstrates that inference-time verification is a scalable mechanism for autonomous policy improvement during deployment.

Inference Economics Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement VERITAS

Related guides (1)

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

7arXiv · cs.CL·22d ago·source ↗

Self-Trained Verification (STV) Unlocks Training- and Test-Time Self-Improvement for Reasoning Models

This paper introduces Self-Trained Verification (STV), a method that trains a verifier to imitate a more informed version of itself by leveraging reference solutions as supervision signal, addressing the core bottleneck in both test-time verification-refinement loops and self-training pipelines. At test time, STV roughly doubles accuracy on hard math and achieves a 14x lift on scientific reasoning tasks. At training time, the authors combine STV with RL in a procedure called Verifier-in-the-Loop (ViL) training, yielding a 33% further gain in pass@1 over an already RL-converged generator, with standalone pass@1 climbing 30% relative past standard RL convergence. The work argues that verification quality, not generation, is the primary bottleneck for scaling reasoning on hard problems.

Frontier Model Releases Evaluation and Benchmarking self-training Verifier-in-the-Loop Training (ViL)Self-Trained Verification (STV)+4 more

6Openai Blog·1mo ago·source ↗

Prover-Verifier Games improve legibility of language model outputs

OpenAI presents research on prover-verifier games as a mechanism to improve the legibility and verifiability of language model outputs. The approach frames output generation as a game between a prover (the model producing solutions) and a verifier (checking correctness), incentivizing clearer, more human-auditable reasoning. The work targets a core alignment challenge: ensuring AI-generated solutions are interpretable and trustworthy to both humans and automated systems.

Evaluation and Benchmarking AI Safety Research Prover-Verifier Games OpenAI scalable oversight +1 more

6arXiv · cs.AI·23d ago·source ↗

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1 is a generalist visual verifier trained using symbolic meta-verification rationales (e.g., bounding boxes) and decoupled reinforcement learning objectives for binary judgment versus meta-verification. The paper finds that symbolic verifier outputs outperform textual explanations as rationales, enabling rule-based RL rewards without auxiliary judge models, and that decoupling RL objectives substantially improves performance over joint optimization. The system further enables M1-TTS, a verifier-driven agentic generation pipeline supporting dynamic region-level self-correction in multimodal outputs.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models multimodal meta-verification decoupled reinforcement learning +8 more

6arXiv · cs.AI·9d ago·source ↗

CHORUS: Single VLA policy enables decentralized multi-robot collaboration without inter-robot communication

CHORUS is a framework that adapts a single vision-language-action (VLA) backbone to control diverse multi-robot teams in a fully decentralized manner, with each robot running an independent copy conditioned only on its own observations and a robot-identifying prompt. Real-world experiments across tasks like tape measurement, book handovers, and laundry basket lifting show a 64-percentage-point improvement over decentralized from-scratch models and 40-point improvement in reactivity to teammate behavior, while outperforming centralized baselines. The key insight is that pretrained VLA visuomotor priors are sufficient to enable reactive coordination without explicit inter-robot communication or alignment procedures at inference time.

Agent and Tool Ecosystem Multimodal Progress CHORUS CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

6arXiv · cs.CL·8d ago·source ↗

LabVLA: Vision-Language-Action model and RoboGenesis data engine for scientific laboratory robotics

Researchers introduce LabVLA, a Vision-Language-Action model designed to bridge written scientific protocols and physical robot execution in laboratory settings. To address the data scarcity problem, they build RoboGenesis, a simulation-based data engine that composes lab workflows from atomic skills and generates structured demonstrations across robot embodiments. LabVLA uses a two-stage training recipe combining FAST action token pretraining on a Qwen3-VL-4B-Instruct backbone with flow matching posttraining via a DiT action expert. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among evaluated baselines in both in-distribution and out-of-distribution settings.

Agent and Tool Ecosystem Multimodal Progress LabVLA LabUtopia Qwen3-4B-Instruct +3 more

5arXiv · cs.AI·47h ago·source ↗

Distributionally robust optimization framework for probabilistic runtime verification of AI agents

A new arXiv preprint introduces a sound and efficient framework for verifying probabilistic security policies for AI agents operating in complex digital environments, addressing limitations of prior Datalog-based approaches that assumed deterministic policies or predicate independence. The method uses distributionally robust optimization to compute sound upper bounds on policy violation probability without requiring independence assumptions between predicates. Evaluated on benchmarks for terminal and tool-calling agents, the approach outperforms prior art on the security-utility trade-off.

AI Safety Research Agent and Tool Ecosystem Datalog Efficient and Sound Probabilistic Verification for AI Agents distributionally robust optimization

6arXiv · cs.AI·5d ago·source ↗

Self-improving VLMs can silently regress when verifier quality is task-mismatched

A new arXiv paper demonstrates that verifier-driven self-DPO, a common recipe for self-improving visual-language models, can silently degrade student model performance when the verifier's task-rubric accuracy is insufficient for the target task. Experiments on Qwen-3-VL-2B and Qwen-2.5-VL-3B across MathVista, MMMU, and BLINK show regressions of 3.4–10.9 percentage points below frozen baselines, with the counterintuitive finding that more accurate-but-still-wrong verifiers cause larger regressions than near-random ones. The authors provide a mechanistic explanation via a variance theorem for progress-gated replay and offer operational guidance: measure target-task rubric accuracy before running any verifier-driven loop and rank verifiers by task-specific quality rather than parameter count.

Evaluation and Benchmarking Alignment and RLHF MathVista When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks BLINK +5 more

6arXiv · cs.AI·25d ago·source ↗

VeriTrace: Cognitive-Graph Framework with Explicit Regulatory Loops for Deep Research Agents

VeriTrace introduces a cognitive-graph framework for deep research agents that replaces implicit LLM reasoning over intermediate representations with three explicit regulatory loops: interpretive update, deviation feedback, and schema revision. The system addresses contamination and error propagation in evolving mental models during complex multi-step research tasks. Using Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench Insight and 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DeepResearch Bench.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 cognitive-graph DeepResearch Bench +4 more