Every digital product sits on a mountain of interaction logs: time-stamped records of clicks, page loads, keystrokes, and tool invocations. These logs are objective — they record what people actually did, not what they say they did — which makes them the most honest signal a product team has about real usage. The problem is that they are also nearly unreadable. At the granularity logs are captured, the meaningful story of someone's work is buried under noise, and the leap from a sequence of low-level events to “this person was researching a purchase” or “this student is disengaging” is exactly the leap that has been hard to make reliably. A new arXiv paper, “Abstracting Cross-Domain Action Sequences into Interpretable Workflows” (2606.14654), by Gaurav Verma and Scott Counts, proposes using large language models to make that leap.
The authors frame the gap precisely. Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise obscure meaningful insight into people's work — the very insight that is essential for improving products in ways grounded in real behavior rather than guesswork. Prior research tackled this with deep learning models that cluster raw user actions into higher-level activities. But the paper notes two persistent weaknesses in that approach: it is highly sensitive to noise, and it struggles to generalize across different applications. A clustering model tuned for one app's log format tends not to transfer to another's.
Abstraction instead of clustering
WorkflowView, the framework the paper introduces, takes a different tack. Instead of training a bespoke clustering model per application, it uses an LLM to abstract low-level action sequences into high-level activities directly. The bet is that a language model's general capacity to interpret sequences of described actions — to read a run of events and say what they collectively amount to — will be more robust to noise and far more portable across domains than a purpose-built clusterer. The authors' core contribution is the demonstration that this bet pays off across tasks that look quite different from one another.
They establish that breadth deliberately, evaluating across three distinct and challenging sequential tasks in diverse domains. The first is zero-shot task description reconstruction from browser logs: given raw browsing activity and no task-specific training, can the system recover what the user was actually trying to do? WorkflowView achieves high semantic similarity here, reported as a mean similarity of 0.91 — meaning its reconstructed descriptions land very close to the ground-truth tasks. The zero-shot framing is important: there is no per-domain fitting propping up the result.
The second task is few-shot student dropout prediction using interaction logs from massive open online courses (MOOCs). Predicting which students will drop out is a classic, practically valuable problem, and the striking detail is the data efficiency: WorkflowView reaches a weighted F1 of 0.90 with only five few-shot examples. That an LLM-abstraction approach can hit that level of predictive performance from a handful of examples is a direct rebuttal to the assumption that you need large labeled datasets and a custom model to extract behavioral signal from logs.
The third task moves into a privacy-sensitive enterprise setting: anonymized, privacy-preserving analysis of how AI tools are being integrated into document workflows in Microsoft Word. This is the most applied of the three, and the most telling about deployment intent — it is about understanding, in aggregate and without exposing individual users, how people actually fold AI assistance into real document work. That the same abstraction framework spans browser logs, educational platforms, and enterprise productivity software is the generalization claim made concrete.
Why an LLM is the right tool here
The deeper argument is about the nature of the abstraction problem. Turning low-level actions into high-level activities is fundamentally a semantic task — it requires understanding what a sequence of operations means, not just clustering them by statistical similarity. That is squarely in an LLM's wheelhouse and squarely outside the comfort zone of a noise-sensitive clustering model, which has to learn the meaning implicitly from data it may not have enough of. By reframing log interpretation as an abstraction task an LLM performs rather than a clustering task a model is trained for, WorkflowView trades brittleness and per-domain retraining for generality and few-shot or zero-shot operation. The authors conclude that LLM-based abstraction is a robust and efficient path toward transforming low-level behavioral data into high-level, interpretable, and actionable insight.
They are also commendably attentive to the realities of putting this in front of real logging infrastructure. The paper explicitly discusses practical deployment considerations, including computational efficiency and user privacy — the two concerns that most often stand between a promising research result and a shippable feature. Running an LLM over high-volume interaction logs is not free, and logs are among the most sensitive data a company holds; the Word case study's emphasis on anonymized, privacy-preserving analysis signals that these are treated as first-order constraints rather than afterthoughts.
The honest framing is that this is a method paper reporting strong results on three tasks, not a claim that LLM abstraction solves log interpretation universally; the metrics — 0.91 reconstruction similarity, 0.90 dropout F1 — characterize these specific evaluations, and the cost and privacy questions the authors raise are exactly the ones that determine whether the approach scales in production. But the contribution is well chosen and the evidence is cross-domain by design. For anyone sitting on a mountain of interaction logs and frustrated that the signal in them stays locked behind noise and brittle per-app models, WorkflowView offers a genuinely different and more portable key: ask a language model what the actions mean.