SIMMER: Benchmarking Latent Failures in LLM Planning

A kitchen-domain world model with a state-machine executor exposes 'latent failures' in LLM plans - the silent, often irreversible kind benchmarks miss - and finds frontier models produce at most 17% error-free plans.

When an LLM-generated plan fails, we usually notice. A precondition is violated, an action cannot execute, the system halts, and the error is visible enough to correct. But there is a more insidious category of failure that a new arXiv benchmark is built to expose: the plan that keeps running while quietly destroying the very goal it was supposed to achieve. “SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model” (2606.14574), by Xiaoxin Lu, Ranran Haoran Zhang, and Rui Zhang, argues that these latent failures are the blind spot of current planning evaluation — and in the worst cases, they are irreversible.

The distinction the paper draws is the whole point. An immediate failure triggers instant feedback at execution time; the agent hits a wall, gets a signal, and can replan. A latent failure does not halt execution at all. It silently compromises goal achievement, and because nothing stops, nothing alerts. The plan proceeds, step after valid-looking step, toward an outcome that fails to satisfy the goal — and in severe cases causes irreversible harm that no amount of later replanning can undo. Existing benchmarks, the authors observe, largely evaluate whether a plan executes successfully, which by construction overlooks failures that execute just fine and still defeat the objective.

“Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures.”— arXiv:2606.14574 source

A kitchen as a formal world model

To catch failures that don't announce themselves, you need a model of the world precise enough to know what each action actually does to the state of things. SIMMER supplies one, grounded in the kitchen domain — a setting rich in exactly the irreversible, order-sensitive actions that make latent failures vivid. You cannot un-crack an egg or un-burn a sauce, and the consequences of acting on the wrong object compound. The benchmark is built on a human-curated symbolic world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions. Those interactions are described as semantically realistic, derived from real-world cooking scripts rather than invented at random, which is what gives the benchmark its grounding in plausible physical cause and effect.

On top of that world model sits a state-machine executor, and this is the technical heart of the contribution. Rather than merely checking whether each action is individually well-formed, the executor validates a plan step by step against the world model and detects three distinct hazard types. It catches immediate precondition violations — the conventional, visible failures. It catches latent hazards — the silent compromises that motivate the whole paper. And it catches irreversible failures — the subset of latent hazards whose damage cannot be undone. By tracking the evolving state of the simulated world, the executor can see that a plan which never errored out nonetheless arrived somewhere it should not have, which a surface-level success check would never reveal.

Frontier models do badly, and the failures are severe

The experimental results are stark. Across six LLMs, even frontier models achieve at most 17% error-free plans. That number deserves to sink in: in a structured, well-specified kitchen domain, the best models produce a fully clean plan fewer than one time in five. More troubling than the low success rate is the nature of the failures. Up to 56% of plans contain latent failures — the silent kind that would slip past an execution-success check — and the majority of those lead to irreversible consequences. The dominant failure mode is therefore precisely the one existing benchmarks are blind to and the one that matters most for safety: not the plan that visibly breaks, but the plan that quietly and permanently does the wrong thing.

This reframes a comfortable assumption baked into a lot of agent evaluation. A plan that “executes successfully” is not the same as a plan that achieves its goal without latent harm. SIMMER's results suggest the gap between those two standards is enormous, and that current planners are far less reliable than execution-success metrics imply once you start checking for silent, irreversible compromise.

A fix, and an honest boundary

The paper does not stop at diagnosis. It shows that explicit state reasoning via counterfactual foresight simulation — having the model reason forward about what each action would do to the world state before committing to it — can reduce latent failures by up to 72% and irreversible cases by up to 75%. That is a large effect, and it points in a clear direction: the cure for silent failure is making the planner simulate consequences rather than emit actions and hope. A model that asks “what would the world look like after this step?” before acting catches a substantial share of the hazards it would otherwise walk straight into.

The appropriate caveats follow from the design. SIMMER is grounded in a single domain, the kitchen, and a symbolic world model is necessarily an abstraction of physical reality — its 46,800 interactions are curated, not exhaustive of the messy real world. The reliability of the latent-failure and irreversibility judgments rests on the fidelity of that hand-built model and executor, so the precise figures travel best as evidence about LLM planning behavior in a controlled setting rather than as universal constants. But the conceptual contribution generalizes well beyond the kitchen. SIMMER operationalizes a category of failure — silent, goal-defeating, sometimes irreversible — that the field has largely been unable to measure, gives it a concrete executor that can detect it, and demonstrates both how common it is and that forcing models to simulate state can substantially reduce it. For anyone deploying LLMs as planners in environments where actions have lasting consequences, that combination of a measurement tool and a mitigation is exactly the right pairing to take seriously.

SIMMER Catches the Planning Failures That Don't Stop the Plan

A kitchen as a formal world model

Frontier models do badly, and the failures are severe

A fix, and an honest boundary

Comments