benchmark
SIMMER
benchmarkactiveprovisional
simmer-c3723c88·1 events·first seen 2d agoAliases: SIMMER
Co-occurring entities
More like this (12)
Recent events (1)
SIMMER benchmark exposes high rates of latent planning failures in frontier LLMs
Researchers introduce SIMMER, a benchmark for evaluating latent failures in LLM-generated executable plans within a kitchen-domain world model comprising 77 actions, 262 objects, and ~46,800 possible interactions. Unlike existing benchmarks that only catch immediate execution failures, SIMMER detects silent hazards and irreversible consequences using a state machine executor. Experiments across six LLMs find that even frontier models produce error-free plans at most 17% of the time, with up to 56% of plans containing latent failures—most leading to irreversible outcomes. The paper also shows that counterfactual foresight simulation can reduce latent failures by up to 72%, pointing toward a mitigation direction.