Most software fails loudly. A process crashes, a request returns a 500, a test goes red, and an alert fires. The premise of a sobering new arXiv paper is that LLM agent systems have invented a quieter and more dangerous way to fail — one where the error never reaches a human in any actionable form, and in the worst case is actively disguised as a correct answer. “When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime” (2606.14589), by Wei Wu, is a field report from inside a real, continuously running system, and it reads less like a benchmark paper than like an aviation incident review.

The system under study is a personal-assistant agent runtime that has been in continuous production since March 2026. The paper describes its scale with unusual specificity: roughly 40 scheduled jobs, 8 LLM providers, a tool-governance proxy, and a knowledge-base memory plane, all defended by 4,286 unit tests and 827 governance checks. That defensive surface matters, because the central finding is about what got through it anyway. Over eight weeks, the author documented 22 incidents with full root-cause postmortems, and within them identified one recurring meta-pattern — a failure whose error signal never reaches a human in actionable form — that manifested at least 28 times. The artifacts and postmortems are public, which is what elevates this from anecdote to evidence.

Five ways an agent fails in silence

From those incidents the paper derives a five-class, mechanism-oriented taxonomy. Class A covers environment and platform quirks — the messy realities of the infrastructure the agent runs on. Class B covers design-assumption mismatches, where the system behaves as built but the build rested on a wrong premise. Class C is error swallowing and dilution, where a real error is caught and then quietly discarded or weakened until it no longer signals anything. Class E covers operational omission and forensic blind spots — the gaps where no logging or test was watching, so a failure leaves no trace to investigate.

It is Class D that the paper singles out as both unique to LLM systems and the most dangerous: chained hallucination and fabrication. Here the system does not merely fail to report an error. The LLM transforms the error into fluent, plausible narrative and delivers it to the user as if it were a real result. The author gives this its own name — fail-plausible — and offers a precise definition by analogy to the literature on gray failure. Gray failure is characterized by differential observability, where different observers see the system's health differently. Fail-plausible escalates that idea: the observer is not merely blind to the failure, the author writes, but is convincingly lied to by the failure itself. In a conventional system a swallowed error leaves you uninformed; in an LLM agent it can leave you confidently misinformed, which is strictly worse.

Three findings that should unsettle anyone shipping agents

The paper's three headline findings are pointed. First, about 70% of the silent failures were caught by human user-view observation — a person looking at the output and noticing it was wrong — rather than by tests or audits. For a system carrying thousands of automated checks, that is a humbling result: the last line of defense was a human reading the answer.

Second, a retrospective audit of 15 incidents found 0% ex-ante prevention but 87% regression blocking. In other words, the audits prevented exactly none of the incidents before they happened, but once an incident was understood, audits were highly effective at stopping it from recurring. The author's framing is memorable: audits are regression engines, not prediction engines. They are good at remembering past failures and bad at anticipating novel ones — a distinction that should reshape how teams reason about what their test suites actually buy them.

Third, incident latency ranged enormously, from 13 hours to 60 days, and that latency tracked the failure mechanism rather than code complexity. The longest-lived failures lived in what the author calls the seams between components — the integration boundaries where no single test runs because each component is individually correct. This is the structural lesson of the whole paper: the hazard is not in the modules, which are well tested, but in the spaces between them, which are not.

Loud, attributable, and boring

The constructive half of the paper distills design principles for building agent systems whose failures are, in the author's phrase, loud, attributable, and boring — the opposite of silent, anonymous, and dramatic. The aspiration is that an agent should fail in a way that surfaces a clear, human-actionable signal, points unambiguously at its cause, and does so as a routine mundane event rather than a narrative the user has to disbelieve.

The honest caveats are ones the paper itself foregrounds. This is a single-system, eight-week longitudinal study — an n-of-one in the formal sense — and the specific counts (22 incidents, 28 manifestations, the 70% and 87% figures) describe that one runtime, not a population of agent systems. The value is not statistical generalization but the mechanism-oriented vocabulary it provides: fail-plausible, error swallowing and dilution, the audit-as-regression-engine distinction, the seams-between-components hazard. As more organizations move LLM agents from demo to durable production — scheduling jobs, calling tools, maintaining memory, pushing results to people who act on them — that vocabulary is the kind of thing that lets a field learn from incidents instead of repeating them. The paper's most quotable claim deserves to travel: the most dangerous failure in an agent system is not the one that stops it, but the one it narrates around.