Entity · benchmark

SIMMER

benchmarkactivesimmer-c3723c88·1 events·first seen Jun 15, 2026

Aliases: SIMMER

Co-occurring entities

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

More like this (12)

FigSIM SIMPLER SIMA MODSetter SimPO SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model SiT MIRAGE ROMS-IMLE HAMON QUIET ASAM

Recent events (1)

6arXiv · cs.CL·Jun 15, 2026·source ↗

SIMMER benchmark exposes high rates of latent planning failures in frontier LLMs

Researchers introduce SIMMER, a benchmark for evaluating latent failures in LLM-generated executable plans within a kitchen-domain world model comprising 77 actions, 262 objects, and ~46,800 possible interactions. Unlike existing benchmarks that only catch immediate execution failures, SIMMER detects silent hazards and irreversible consequences using a state machine executor. Experiments across six LLMs find that even frontier models produce error-free plans at most 17% of the time, with up to 56% of plans containing latent failures—most leading to irreversible outcomes. The paper also shows that counterfactual foresight simulation can reduce latent failures by up to 72%, pointing toward a mitigation direction.

Evaluation and Benchmarking AI Safety Research SIMMER SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model +1 more