An agent's reasoning is the most ephemeral thing in modern software. The chain of thought that produced an answer evaporates with the context window; the search branches it considered and pruned leave no record; the memory buffer it consulted cannot be diffed, merged, or audited after the fact. Every other complex software process — source code, infrastructure, datasets, experiments — is version-controlled. Reasoning is not. A new arXiv paper, “GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge” (2606.14470) by Pavan C Shekar, Abhishek H S, and Aswanth Krishnan, proposes to close that gap with a deceptively simple idea, and then has the discipline to ask whether the idea actually helps.
Reasoning as a git repository
The mechanism is elegant. GitOfThoughts stores an agent's reasoning tree as a git repository, mapping the machinery of reasoning onto the machinery of version control. Every scored thought becomes a commit. The scores attached to those thoughts are stored as git notes. Outcomes are recorded as tags. And retrieval over the agent's own history becomes, literally, a git log. The payoff is that reasoning inherits everything git already gives software: it becomes replayable, auditable, and mergeable across agents — and, the authors stress, at near-zero engineering cost, because none of this requires inventing new infrastructure. The plumbing already exists; the contribution is recognizing that an agent's thought process fits it.
That alone would be a tidy systems paper. What makes this one notable is that the authors refuse to stop there. Having built a substrate for storing reasoning and memory, they ask the harder and more uncomfortable question: does memory, in any substrate at all, actually improve accuracy?
The experiment, and the answer nobody wants
To answer it, the paper runs a deliberately rigorous comparison across five memory substrates — none, markdown, vector, graph, and git — spanning two benchmarks, two model scales, and pre-registered replications. Pre-registration is the key methodological commitment: by fixing the analysis plan before seeing the results, the authors guard against the all-too-common pattern of a memory technique that looks great until someone tries to reproduce it. And that guard earns its keep, because the headline finding is a negative one. For novel problems, the answer is no. No memory format reliably helps. The authors are candid that a promising early result collapsed under its own pre-registered replication — exactly the kind of outcome that usually never makes it into a paper.
The negative result is not the whole story, though; the paper locates precisely where memory does pay off, and the boundary is sharp. Memory helps only above what the authors call the copyability threshold: when the retrieved case is a near-duplicate of the current problem, with similarity above roughly 0.8, accuracy jumps sharply. Below that threshold, nothing. And the interpretation of even this gain is deflationary in the most useful way. The benefit is answer retrieval, not method transfer. The system is not learning a reusable technique from a worked example; it is, in effect, looking up the answer to a problem it has essentially already seen. The authors sharpen the point with a striking observation: a model 4.5 times larger doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. Scale buys you better matching, not better generalization from memory.
The only general-purpose lever the paper finds for improving accuracy is not memory at all — it is test-time sampling, drawing more candidate solutions at inference. That is a quietly important conclusion for anyone building agent systems on the assumption that bolting on a vector database of past reasoning will make the agent smarter on new problems. According to these experiments, it will not, unless the new problem is nearly a copy of an old one. The implicit hope behind agent memory — that an agent could distill a method from one solved problem and apply it to a structurally different one — is precisely the thing the data refuses to support. What the agent can do is recognize when it has seen this problem before; what it cannot do, in these tests, is learn how to solve a new one by studying an old one.
Why version control still wins
Given a negative result on accuracy, why build GitOfThoughts at all? The authors answer directly: the case for git-as-substrate is auditability, provenance, and mergeability — at accuracy parity. In other words, if no memory format reliably improves accuracy on novel problems, then the right basis for choosing a substrate is no longer the accuracy it buys, because they all buy roughly the same. It is the operational properties it provides. Git wins on those terms: you can replay exactly how an agent reasoned, attribute each step to a scored commit, diff one reasoning run against another, and merge the histories of multiple agents. For debugging, governance, and trust — the things that actually matter when an agent runs in production — those properties are valuable independent of any accuracy bump.
What gives this paper unusual credibility is its evaluation hygiene, which the authors foreground as the point. They explicitly document a retracted result and a refuted hypothesis, framing it as modeling the evaluation standard they hold themselves to. In a subfield where agent-memory techniques are announced almost weekly, often on a single benchmark with no replication, a paper that pre-registers its tests, watches its own promising result collapse, and reports that collapse anyway is doing something the field badly needs. The honest reading of GitOfThoughts is therefore two findings braided together: a clean, low-cost mechanism for making agent reasoning version-controlled, and a hard-won, well-controlled piece of negative evidence about what memory can and cannot do. Builders should take both — adopt the auditability, and drop the assumption that memory alone makes an agent generalize.