Most discussion of offline reinforcement learning assumes the logged data is complete: every action taken in the historical record has a reward attached to it. A preprint posted to arXiv on June 18, 2026 by Ziheng Wei, Annie Qu, and Rui Miao starts from the opposite, more realistic premise — that in domains like health care and marketing, the rewards in batch data are frequently unobserved, and not for innocent reasons. Connect the dots and you arrive at the paper's central problem: when rewards go missing in a way that depends on what the reward would have been, evaluating a policy from that data becomes biased in a way no amount of conditioning on the recorded states and actions can fix.
The technical name for this is “missing not at random,” or MNAR. The authors are precise about why it is harder than ordinary missing data. Standard off-policy evaluation leans on an ignorability assumption — roughly, that once you know the state and action, whether the reward was recorded is unrelated to its value. MNAR breaks that assumption. Records get dropped because of sparse or irregular record-keeping, or because the reward is censored beyond certain values, and that selection mechanism is entangled with the quantity being estimated.
"We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions."— Wei, Qu, and Miao, arXiv:2606.20206, source
Follow the construction and you can see how the authors get traction on a problem that, on its face, looks unidentifiable. They formalize a reward-dependent propensity model — an explicit acknowledgment that the probability a reward is observed depends on the reward itself. To recover what is unobservable, they borrow an idea from the missing-data and causal-inference literature: shadow variables. In their setup, future states play that role. The intuition is that the trajectory that unfolds after a step carries information about the reward at that step, so the downstream states can stand in as a proxy that lets the full-data conditional mean reward be identified.
The bridge function and a min-max trick
The second move is where the paper tries to sidestep a modeling burden. Rather than estimate the MNAR missingness mechanism directly — which would require getting the selection model right — the authors introduce what they call a bridge function. As they describe it, the bridge function recovers the conditional mean reward “without explicitly modeling the MNAR mechanism.” That is the kind of structural choice that matters in practice: every component you have to specify is a component you can misspecify, and avoiding a direct model of the missingness removes one source of error.
Estimating that bridge function is itself nontrivial, and the authors report doing it via a min-max procedure, which they say is chosen to avoid double sampling. Double sampling — needing two independent draws of a future quantity from the same state to form an unbiased estimate — is a recurring obstacle in this corner of reinforcement learning theory, because real logged data rarely gives you two parallel futures from one point. Framing the estimation as a saddle-point problem is the standard way the field has learned to dodge that requirement.
Fitted-Q-Evaluation and a clinical test
With identification settled, the authors build an estimator in the Fitted-Q-Evaluation style — the workhorse approach to off-policy evaluation, which iteratively fits the value of a policy backward through the horizon. Their version propagates the recovered rewards through that procedure. One detail distinguishes their target from a plain value estimate: the paper allows the target policies to depend on past missingness indicators. In other words, the policies being evaluated are explicitly “missingness-aware” — they can condition on whether earlier rewards were recorded, which is a candid concession that in the real systems this is meant for, the pattern of what is missing is itself information an agent might act on.
On the theory side, the authors state that they establish consistency and finite-sample error bounds for the estimator. Consistency is the assurance that the estimate converges to the truth as data grows; finite-sample bounds quantify how far off it can be at a given sample size. Both are the load-bearing guarantees a method like this needs before anyone would trust it to evaluate a policy that cannot be tried live.
The empirical claim is anchored in two settings. The authors report experiments on simulated data and on MIMIC-III Sepsis data, the latter a widely used de-identified critical-care dataset that has become a common proving ground for offline reinforcement learning in medicine precisely because deploying an untested treatment policy on real patients is not an option. The paper states that the method shows strong performance compared with existing methods on both. As with any preprint, those comparisons await independent replication, and the result is a method-and-bounds contribution rather than a clinical recommendation; the authors do not claim a deployable treatment policy, only an evaluation procedure that holds up under a missingness pattern earlier methods assume away.
What ties the pieces into one story is the recurring theme that the hard part of offline reinforcement learning is often the data, not the algorithm. The preprint, listed under cs.LG and dated June 18, 2026, threads a reward-dependent propensity model, shadow variables, a bridge function, and a Fitted-Q-Evaluation estimator into a single pipeline aimed at one assumption — ignorability — that the authors argue does not hold in the settings where off-policy evaluation is most needed.
Comments
Loading comments…