A preprint posted to arXiv on June 18, 2026 takes aim at a quiet weakness in how interpretability methods get evaluated: the synthetic datasets used to test them. Authors Aryeh Brill and Tom Ingebretsen Carlson argue that the toy datasets researchers reach for when they want a controlled experiment do not look like real data in one structurally important way, and they propose a replacement drawn from statistical physics.
The motivation is stated plainly in the abstract. Natural data — images, text, audio — carries structure at many scales at once, and neural networks learn features that mirror that hierarchy. A synthetic dataset that is flat, or that has structure only at a single scale, may not exercise an interpretability method the way real data would. The way this actually works in practice is that a probing or feature-attribution method gets validated on a toy problem, declared to work, and then behaves differently when pointed at a model trained on messy hierarchical data. The paper's premise is that closing the gap between the toy and the real requires building the hierarchy into the toy from the start.
"Synthetic datasets used to evaluate interpretability methods typically lack this structure, limiting their value as realistic toy models."— Brill and Carlson, arXiv:2606.20347, source
The mechanism the authors choose is critical mean-field percolation. Forget the name for a second — here is what it gives you. Percolation is a model of how clusters form when you randomly connect points; at the “critical” point, the resulting clusters are sparse, low-dimensional, and fractal, and the distribution of cluster sizes follows a power law. Those are precisely the statistical fingerprints — sparsity, self-similarity, heavy-tailed size distributions — that the paper associates with natural data. The authors embed these clusters in a high-dimensional space, then attach a target value to each data point using latent variables that model a taxonomic hierarchy. The result is a dataset where the “answer” for any point is generated by a known, layered set of hidden factors.
Why analytical tractability matters here
The selling point the authors lean on is that the data model is not just structured but mathematically pinned down. Critical percolation comes with known critical exponents — numbers that characterize its scaling behavior — and the paper states these exponents fix the model's properties without requiring hyperparameter tuning. That distinction is worth dwelling on. Many synthetic benchmarks have knobs the experimenter has to set, and the choice of those knobs can quietly determine the outcome of whatever method is being tested. A data model whose properties are fixed by theory removes that degree of freedom: the ground truth is what the mathematics says it is, not what the researcher dialed in.
The authors also address a practical concern: can you actually generate this data at scale? They report leveraging a mapping between percolation clusters, random trees, and a process called additive coalescence to design what they describe as an almost linear-time algorithm. As they put it, the algorithm jointly samples a random tree and its hierarchical latent decomposition, “enabling data generation at arbitrary scale.” In other words, the same procedure that produces a data point also produces the labeled hierarchy of latent factors behind it — which is exactly the ground truth an interpretability experiment needs to check its work against.
What the probing experiments showed
The paper's empirical claim is specific and limited, which is appropriate for a testbed proposal. Using probing experiments — the standard technique of training a simple classifier to read a target variable out of a network's internal activations — the authors report that the model's ground-truth latent variables can be linearly decoded from neural network activations. Stated carefully, that means the hidden factors the data was built from leave a linearly accessible trace inside the trained network. That is the kind of result a testbed is meant to enable: because the latent variables are known by construction, a researcher can ask whether a given interpretability method actually recovers them, rather than inferring success from a proxy.
It is worth separating what the paper demonstrates from what it does not. The contribution is a data model and a sampling algorithm, plus an existence-style probing result showing the latent structure is decodable. The authors do not claim to have resolved any open interpretability question; they position critical percolation as an instrument for studying such questions. The closing line of the abstract frames the case in terms of four properties working together — sparsity, self-similarity, power-law statistics, and analytical tractability — which the authors say make critical percolation a principled testbed for interpretability research.
For readers tracking the broader interpretability literature, the move here is methodological rather than about a particular model or capability. The field has leaned heavily on toy setups — sparse linear features, small grammars, synthetic shapes — to validate tools like probes, sparse autoencoders, and feature attributions. The argument in this preprint is that borrowing a well-understood critical phenomenon from physics supplies hierarchy and heavy-tailed statistics for free, with the bookkeeping of known exponents attached. Whether the percolation testbed gets adopted will depend on independent groups reproducing the decoding result and finding that methods which pass on percolation data also behave better on real models. The preprint, dated June 18, 2026 and listed under the cs.LG category, sets out the construction and the first probing evidence; the rest is a question for the methods that will be run against it.
Comments
Loading comments…