SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model
simmer-benchmarking-latent-failures-in-llm-executable-planning-with-a-world-model-60fae0d8·1 events·first seen 2d agoAliases: SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model
Co-occurring entities
More like this (12)
Recent events (1)
SIMMER benchmark exposes high rates of latent planning failures in frontier LLMs
Researchers introduce SIMMER, a benchmark for evaluating latent failures in LLM-generated executable plans within a kitchen-domain world model comprising 77 actions, 262 objects, and ~46,800 possible interactions. Unlike existing benchmarks that only catch immediate execution failures, SIMMER detects silent hazards and irreversible consequences using a state machine executor. Experiments across six LLMs find that even frontier models produce error-free plans at most 17% of the time, with up to 56% of plans containing latent failures—most leading to irreversible outcomes. The paper also shows that counterfactual foresight simulation can reduce latent failures by up to 72%, pointing toward a mitigation direction.