6arXiv cs.CL (Computation and Language)·6d ago

SIMMER benchmark exposes high rates of latent planning failures in frontier LLMs

Researchers introduce SIMMER, a benchmark for evaluating latent failures in LLM-generated executable plans within a kitchen-domain world model comprising 77 actions, 262 objects, and ~46,800 possible interactions. Unlike existing benchmarks that only catch immediate execution failures, SIMMER detects silent hazards and irreversible consequences using a state machine executor. Experiments across six LLMs find that even frontier models produce error-free plans at most 17% of the time, with up to 56% of plans containing latent failures—most leading to irreversible outcomes. The paper also shows that counterfactual foresight simulation can reduce latent failures by up to 72%, pointing toward a mitigation direction.

Evaluation and Benchmarking AI Safety Research Agent and Tool Ecosystem SIMMER SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

GIM: A Grounded Integration Measure Benchmark for Evaluating Multi-Domain Cognitive Coordination in LLMs

The Grounded Integration Measure (GIM) is a new LLM benchmark of 820 original problems designed to resist benchmark saturation by requiring integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—over broadly accessible knowledge. Unlike knowledge-escalation benchmarks (GPQA, HLE) or pure abstraction benchmarks (ARC-AGI), GIM grounds reasoning in realistic tasks without gating on specialized expertise. The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models and 47 test configurations, producing the most extensive published study of test-time compute vs. model capability tradeoffs on a fixed benchmark. A key finding is that within-family configuration choices (thinking budget, quantization) matter as much as model selection.

Frontier Model Releases Evaluation and Benchmarking 2-Parameter Logistic IRT Model GIM (Grounded Integration Measure)test-time compute +4 more

5arXiv · cs.CL·12d ago·source ↗

AGENTSERVESIM: Hardware-aware simulator for multi-turn LLM agent serving policies

Researchers introduce AGENTSERVESIM, a simulation framework designed to evaluate serving policies for multi-turn LLM agents without requiring dedicated accelerator hardware. The simulator models program-level execution including turn dependencies, tool-induced gaps, and KV-cache residency across HBM, host DRAM, and CXL memory hierarchies. It reproduces real-system behavior within 6% error on key performance metrics while running on commodity CPUs, enabling cost-effective exploration of scheduling, routing, and cache management policies for agentic workloads.

Training Infrastructure Inference Economics AGENTSERVESIM +1 more

5arXiv · cs.AI·13d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

6arXiv · cs.AI·11d ago·source ↗

Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks

A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.

Frontier Model Releases Evaluation and Benchmarking Flaws in the LLM Automation Narrative

6arXiv · cs.CL·1mo ago·source ↗

STT-Arena: Benchmark for Adaptive Replanning Under Spatio-Temporal Dynamics in Tool-Using LLMs

STT-Arena is a new benchmark of 227 interactive tasks designed to evaluate LLMs' ability to detect mid-task disruptions and replan under spatio-temporal dynamics, covering nine conflict types and four solvability levels. Evaluation of frontier models including Claude-4.6-Opus shows less than 40% overall accuracy, revealing fundamental limitations in dynamic reasoning. The authors identify three recurring failure modes—Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification—and propose an iterative trajectory refinement technique combined with online RL to train STT-Agent-4B, a 4B-parameter model that outperforms frontier LLMs on the benchmark.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 iterative trajectory refinement spatio-temporal dynamic reasoning +5 more

6arXiv · cs.CL·11d ago·source ↗

JANUS benchmark measures goal-conditioned pragmatic distortion in LLMs

Researchers introduce JANUS, a 160-scenario benchmark designed to measure a subtle but dangerous form of LLM deception: selective treatment of true facts to create misleading impressions, rather than outright fabrication. Each scenario provides a fixed fact pool and compares neutral versus goal-directed prompts (e.g., increasing adoption or enrollment), isolating pragmatic distortion from hallucination. Experiments across 12 LLMs reveal consistent goal-conditioned distortions, suggesting current models lack robust safeguards against selectively misleading communication. The benchmark and code are publicly released.

Evaluation and Benchmarking AI Safety Research JANUS Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs +1 more

6arXiv · cs.CL·11d ago·source ↗

PhysTool-Bench reveals severe gaps in MLLM physical tool use and embodied planning

Researchers introduce PhysTool-Bench, the first benchmark evaluating multimodal LLMs on physical tool use across 2,510 queries and 2,678 real-world tools spanning manufacturing, electrical work, agriculture, and healthcare. Evaluation of 13 leading MLLMs shows even the best model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes just 21.0% of queries end-to-end. The results expose a two-level deficit: poor tool perception in realistic scenes and a much larger drop at the planning stage, indicating a lack of functional commonsense for mapping tools to task semantics. This pinpoints a critical bottleneck for embodied AI development.

Evaluation and Benchmarking Agent and Tool Ecosystem Google PhysTool-Bench Gemini-3.1-Pro +1 more

6arXiv · cs.CL·22d ago·source ↗

BeliefTrack: Benchmarking and Improving Contextual Belief Management in LLMs

This paper introduces Contextual Belief Management (CBM) as a framework for studying how LLMs should update, preserve, or ignore information across long-horizon interactions. The authors release BeliefTrack, a closed-world benchmark with symbolic verifiers enabling exact turn-level evaluation across Rule Discovery and Circuit Diagnosis tasks. Vanilla LLMs show severe CBM failures; reinforcement learning with belief-state rewards reduces failure rates by 70.9% on average, while representation-level steering achieves 46.1% reduction. Probing experiments reveal latent belief-state dynamics underlying these failures.

Evaluation and Benchmarking Agent and Tool Ecosystem reinforcement learning with belief-state rewards Contextual Belief Management (CBM)BeliefTrack +3 more