Here's what published — published is not granted. Application US20250390703A1, "Optimizing Key Value Cache for Large Language Model Inference," published December 25, 2025, lists inventors Bowen Liang, Noam Mordechai Shazeer, and Myle Ott — Shazeer being a foundational name in transformer and attention research. The single CPC code is G06N 3/043.

The mechanism targets the dominant cost of LLM serving. When a transformer generates text token by token, it caches the key and value vectors it has already computed for prior tokens — the KV-cache — so it doesn't recompute them each step. That cache grows with sequence length and quickly dominates memory and bandwidth, becoming the bottleneck that limits context length and throughput. This application claims a method for optimizing that cache — managing or shrinking it so inference is cheaper or longer contexts fit.

The KV-cache is one of the most contested optimization surfaces in the field right now, because whoever serves long-context models cheaply has a real cost advantage. The inventor list ties this to top-tier attention research, which makes it a notable filing regardless of its eventual scope. Several other 2025 records in the same area — from Intel, HPE, and others — confirm the KV-cache is where the inference-efficiency fight has concentrated.

Because this is a publication, the discipline is strict: published is not granted. The claims as filed reflect what the applicant seeks; the allowed claims, if any issue, set the enforceable boundary, and they may narrow. There is nothing to assert today.

The takeaway: US20250390703A1 is a high-profile published marker on the KV-cache bottleneck — the memory problem at the center of LLM serving economics — with a marquee inventor, and exactly the kind of filing where waiting for the allowed claims is the responsible move.