Reasoning models have settled into a comfortable default: read the whole question, think over a fixed context, then answer. That read-then-think paradigm is fine for a static math problem, but it breaks down the moment the input is a live stream. Audio and video do not arrive all at once; they unspool in time, and a model that insists on waiting for the final token before reasoning is structurally late. A new arXiv paper, “AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization” (2606.14694), takes aim at exactly this mismatch, proposing a framework that lets a model reason while the stream is still flowing and then deliberate once it ends.
The authors, a group led by Junlong Tong, frame the problem cleanly. Many real-world scenarios are inherently dynamic: information arrives continuously, and the model must reason, update its understanding, and respond under partial observation. Recent streaming-reasoning methods already let models “think while reading,” which is the right instinct, but the paper notes a limitation in how those methods are trained. They largely rely on supervised imitation of pre-constructed reasoning trajectories — the model is shown example traces of when to think during a stream and learns to copy them. That works, but it bakes in the trajectory designer's choices and limits the model's flexibility to discover its own policy for allocating thought.
Two phases, learned rather than scripted
AdaSR's central design move is to treat reasoning as a two-stage, hierarchical process. During the stream, the model performs incremental reasoning over the partial input it has seen so far. Once the stream completes, it shifts into a final deliberation phase — the deeper, more deliberate reasoning that benefits from having the whole picture. The framework is meant to learn three things at once: when to think during the stream, how much computation to spend at each stage, and how to balance early, partial reasoning against late, complete reasoning. Crucially, these are learned behaviors rather than scripted ones, which is what separates AdaSR from the imitation-based predecessors it positions itself against.
Making that work requires a training objective that can reward the two phases differently, and this is where the paper's technical contribution sits. The authors introduce Hierarchical Relative Policy Optimization, or HRPO. Standard policy-optimization methods used for reasoning models tend to compute a single advantage at the level of the whole output sequence and then spread it uniformly across every token. AdaSR argues that this is too blunt for a process with distinct phases: the tokens spent reasoning mid-stream and the tokens spent in final deliberation are doing different jobs, and they should not all share one undifferentiated credit signal. HRPO instead decomposes policy optimization into the streaming-reasoning phase and the deep-reasoning phase, providing what the paper calls more fine-grained advantage assignment. In plainer terms, the model gets more targeted feedback about which kind of thinking, at which stage, actually paid off.
Rewarding the right kind of effort
HRPO is steered by a composite reward with three explicit components. A format reward enforces valid reasoning protocols — the model has to structure its streaming and deliberation phases correctly rather than collapsing them. An accuracy reward preserves final task performance, the non-negotiable anchor that keeps the model from gaming latency at the expense of correctness. And an adaptive thinking reward encourages latency-aware computation allocation: the model is nudged to think when thinking helps and to avoid burning tokens when it does not. That third term is the heart of the “adaptive” in AdaSR. The framework is not trying to make the model think more or less in the abstract; it is trying to make the model think at the right moments and in the right amounts, given that the input is still arriving and that latency has a cost.
This is a genuinely different optimization target from the one most reasoning research pursues. The dominant trend has been to lengthen chains of thought in pursuit of accuracy, accepting whatever compute and latency that implies. AdaSR treats latency as a first-class objective sitting alongside accuracy, and treats the timing of computation — not just its quantity — as something the model should learn to control. For streaming settings, where a response that arrives too late may be useless regardless of correctness, that reframing is the right one.
What the experiments show, and what to watch
The paper reports that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency than a supervised fine-tuning baseline. The careful word there is “balance.” This is not a claim that AdaSR is simply more accurate; it is a claim about the trade-off frontier — that for a given accuracy you can get lower latency and less wasted computation, or for a given latency budget you can hold accuracy steadier. That is the honest and appropriate way to evaluate a method whose entire premise is managing a three-way tension rather than maximizing one axis. The authors release their code, which is the right gesture for a reinforcement-learning method whose reward shaping and phase decomposition will need independent scrutiny before the community can judge how broadly the gains hold.
A few caveats are worth stating plainly. The comparison is against a supervised fine-tuning baseline, so the relevant question for practitioners is how AdaSR stacks up against the strongest streaming-reasoning systems, not just against imitation learning. Reward-shaped reinforcement learning is also notoriously sensitive: a three-term reward balancing format, accuracy, and adaptive thinking has knobs that can be tuned to flatter a benchmark, and the durability of the result depends on how robust that balance proves across domains beyond those tested. None of this undercuts the contribution; it simply frames it. The lasting idea in AdaSR is not a single benchmark number but a thesis — that a reasoning model operating on a live stream should learn, as part of its policy, when to think and how much, and that the training objective should reward the streaming and deliberation phases as the distinct activities they are. As models move out of static-prompt benchmarks and into real-time audio and video, that thesis is likely to outlast the specific implementation, and HRPO's phase-aware advantage assignment is a concrete, reusable mechanism for putting it into practice.