Entity · paper

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

paperactivesimmer-benchmarking-latent-failures-in-llm-executable-planning-with-a-world-model-60fae0d8·1 events·first seen Jun 15, 2026

Aliases: SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

Co-occurring entities

SIMMER

More like this (12)

Co-LMLM: Continuous-Query Limited Memory Language Models Do Language Models Dream of Binding Molecules? Benchmarking LLMs under Spatial Constraints Towards Root Memories: Benchmarking and Enhancing Implicit Logical Memory Retrieval for Personalized LLMs OpenSCAD Architectural 3D LLM Benchmark Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs Win by Silence: Deletion Non-Monotonicity, Autonomous Exploitation, and Typed-State Gating in LLM Plan Evaluation LMs as Task-Specific Knowledge Bases: An Interpretability Analysis AdaJEPA: An Adaptive Latent World Model Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability Courteous Anticipation: Improving Long-Lived Task Planning in Persistent Shared Environments Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability Always-OnAgents: A Survey of Persistent Memory, State, and Governance in LLM Agents

Recent events (1)

6arXiv · cs.CL·Jun 15, 2026·source ↗

SIMMER benchmark exposes high rates of latent planning failures in frontier LLMs

Researchers introduce SIMMER, a benchmark for evaluating latent failures in LLM-generated executable plans within a kitchen-domain world model comprising 77 actions, 262 objects, and ~46,800 possible interactions. Unlike existing benchmarks that only catch immediate execution failures, SIMMER detects silent hazards and irreversible consequences using a state machine executor. Experiments across six LLMs find that even frontier models produce error-free plans at most 17% of the time, with up to 56% of plans containing latent failures—most leading to irreversible outcomes. The paper also shows that counterfactual foresight simulation can reduce latent failures by up to 72%, pointing toward a mitigation direction.

Evaluation and Benchmarking AI Safety Research SIMMER SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model +1 more