Almanac
paper

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

paperactiveprovisionalsimmer-benchmarking-latent-failures-in-llm-executable-planning-with-a-world-model-60fae0d8·1 events·first seen 2d ago

Aliases: SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·2d ago·source ↗

SIMMER benchmark exposes high rates of latent planning failures in frontier LLMs

Researchers introduce SIMMER, a benchmark for evaluating latent failures in LLM-generated executable plans within a kitchen-domain world model comprising 77 actions, 262 objects, and ~46,800 possible interactions. Unlike existing benchmarks that only catch immediate execution failures, SIMMER detects silent hazards and irreversible consequences using a state machine executor. Experiments across six LLMs find that even frontier models produce error-free plans at most 17% of the time, with up to 56% of plans containing latent failures—most leading to irreversible outcomes. The paper also shows that counterfactual foresight simulation can reduce latent failures by up to 72%, pointing toward a mitigation direction.