Across the portfolio, the inference-efficiency story has a clear center of gravity and some revealing thin spots. The pile-up is around two mechanisms. First, sparse activation: Google's mixture-of-experts family, anchored by the foundational US10719761B2 and extended through the 2026 sparse, differentiable variant US12518135B2, activates only part of a model per token. Second, adaptive compute: Microsoft's US12547872B2 routes inputs by perplexity, spending more computation only on hard ones. Two granted approaches to the same goal — cut the cost of the workload that never stops, inference.
Where the granted cluster is dense, the whitespace is at its margins, and you can see entrants being pushed there in the application record. Intel's published US20260162278A1, "Image Token Pruning for Multimodal Foundation Models" (published June 11, 2026; CPC G06V 10/764), targets a different efficiency lever — pruning visual tokens before they hit the model — rather than re-claiming routing. That is the shape of whitespace-seeking behavior: when the core routing mechanisms are spoken for by grants, a later filer attacks an adjacent, less-crowded part of the inference budget.
The honest framing here is that "whitespace" is a hypothesis about where rights are thin, not a guarantee that an area is open. Token pruning, on-device inference, and domain-specific routing look less crowded with granted claims than core MoE and perplexity-routing do — but much of that apparent openness is simply that the relevant filings are still published applications, not yet examined to grant. Thin-on-grants today can become dense-on-grants in two years as those applications mature.
Two analyst caveats apply with full force. First, these are application counts as much as grant counts, and application counts are not granted rights — an A1 publication signals interest in an area, not ownership of it. Second, the whole picture is filing-date sensitive: inference efficiency is among the hottest areas in AI IP, so the cluster and its edges are moving targets. Any whitespace called today should be rechecked against next quarter's publications.
Still, the structural read is useful for anyone tracking where the field is heading. The granted IP concentrates on the two big per-query levers — which parts of the model to run (MoE) and how much to run on a given input (perplexity routing). The activity at the edges — token pruning, on-device, domain routing — is where the not-yet-granted applications are accumulating. Watch those edges: that is where the next wave of inference-efficiency grants will likely issue, and where today's whitespace will either fill in or prove durable.