Almanac
← Events
6arXiv cs.AI (Artificial Intelligence)·1mo ago

Mind the Sim-to-Real Gap & Think Like a Scientist: Fisher-SEP for Simulation-Aided Experimental Policy

This paper studies when and how a planner should supplement a pre-trained simulator with real-world experiments in sequential decision problems. The authors decompose simulator value error into a calibration-deployment shift (identifiable via randomization) and an irreducible parametric residual, and show that purely passive learning cannot close the reachability component of the value gap. They propose Fisher-SEP, a simulation-aided experimental policy that minimizes posterior predictive variance of a target policy's value, with case studies in supply chain and HIV mobile-testing domains demonstrating regimes where designed exploration is necessary.

Related guides (3)

Related events (8)

6arXiv · cs.CL·5d ago·source ↗

SIMMER benchmark exposes high rates of latent planning failures in frontier LLMs

Researchers introduce SIMMER, a benchmark for evaluating latent failures in LLM-generated executable plans within a kitchen-domain world model comprising 77 actions, 262 objects, and ~46,800 possible interactions. Unlike existing benchmarks that only catch immediate execution failures, SIMMER detects silent hazards and irreversible consequences using a state machine executor. Experiments across six LLMs find that even frontier models produce error-free plans at most 17% of the time, with up to 56% of plans containing latent failures—most leading to irreversible outcomes. The paper also shows that counterfactual foresight simulation can reduce latent failures by up to 72%, pointing toward a mitigation direction.

7arXiv · cs.CL·23d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

5Openai Blog·1mo ago·source ↗

Sim-to-real transfer of robotic control with dynamics randomization

OpenAI published research on transferring robotic control policies trained in simulation to real-world robots using dynamics randomization. The technique involves varying physical parameters during simulation training so that the real world appears as just another variation, enabling zero-shot sim-to-real transfer. This was an early foundational contribution to the sim-to-real robotics research thread.

5arXiv · cs.CL·11d ago·source ↗

AGENTSERVESIM: Hardware-aware simulator for multi-turn LLM agent serving policies

Researchers introduce AGENTSERVESIM, a simulation framework designed to evaluate serving policies for multi-turn LLM agents without requiring dedicated accelerator hardware. The simulator models program-level execution including turn dependencies, tool-induced gaps, and KV-cache residency across HBM, host DRAM, and CXL memory hierarchies. It reproduces real-system behavior within 6% error on key performance metrics while running on commodity CPUs, enabling cost-effective exploration of scheduling, routing, and cache management policies for agentic workloads.

5arXiv · cs.LG·3d ago·source ↗

Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control

Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.

4Openai Blog·1mo ago·source ↗

Generalizing from Simulation: OpenAI Sim-to-Real Robotics Transfer

OpenAI published results on sim-to-real transfer for robot controllers, demonstrating that policies trained entirely in simulation can be deployed on physical robots and respond to unplanned environmental changes. The work represents a shift from open-loop to closed-loop control systems in robotics. This is a 2017 research milestone predating current frontier model work but relevant to the historical trajectory of OpenAI's robotics program.

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

GRASP: Gradient-based Planning for World Models at Longer Horizons

Researchers from Berkeley, Meta, and collaborators introduce GRASP, a gradient-based planner designed to make long-horizon planning with learned world models more robust. The method addresses three core failure modes: ill-conditioned computation graphs from backpropagation through time, non-greedy loss landscapes with many local minima, and brittle gradients through high-dimensional vision models. GRASP lifts trajectory optimization into virtual states for parallel optimization across time, injects stochasticity into state iterates for exploration, and reshapes gradients to avoid problematic state-input gradient paths. The work is positioned in the context of scaling world models toward general-purpose simulators usable for control and planning.

7arXiv · cs.CL·1mo ago·source ↗

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory is a fully automated framework for training tool-use LLM agents via Agentic Reinforcement Learning, addressing two key bottlenecks: scalable execution environments and realistic multi-turn training data. It autonomously constructs stateful, executable tool environments from authentic resources and synthesizes natural trajectories with implicit human intents via topology-aware sampling. Using only 85 verified environments across 7 domains, it generates 2,575 SFT and RL trajectories and improves Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks, outperforming prior approaches that use 5x more environments.