Just Issued: Intel Neural-Network Compression Patent

A July-2021 Intel grant compresses neural networks by clustering weights per kernel. What claim 1 covers and why on-device matters.

Here's what actually issued. On July 6, 2021, Intel Corporation was granted US11055604B2, "Per kernel Kmeans compression for neural networks," inventors including Yonatan Glesner and Gal Novik. The CPC mix pairs G06N 3/04 (network architecture) with G06F memory codes (G06F 3/0608, 3/0644, 3/0673), which signals a method aimed at storage and memory layout, not just abstract accuracy.

The mechanism is weight clustering. A trained network's weights are numerous and mostly redundant. K-means groups similar weight values into clusters; you then store the cluster centers plus an index per weight instead of every full-precision value. Doing this per kernel — separately for each filter — tunes the compression to local structure rather than applying one codebook globally. The result is a smaller model that fits in less memory and moves fewer bytes.

“Methods and apparatus relating to techniques for incremental network quantization.”— U.S. Patent No. 11,055,604 source

The independent claim grounds that idea in hardware. Claim 1 is not a bare algorithm — it is "a general-purpose graphics processing device" comprising an instruction cache, an instruction unit to execute a stream of instructions, a compute block with "a plurality of processing resources," and a shared cache memory that receives data representing one or more layers of a convolutional neural network. The processing resources are then configured to do four specific things: determine a plurality of weights for a CNN layer that comprises a plurality of kernels; organize those weights into clusters "on a per-kernel basis"; compute a center point for the weights, again per kernel, in each cluster; and apply a K-means compression algorithm to each cluster. The per-kernel granularity is in the claim language itself, not just the title — the clustering and the center-point computation are both scoped to the individual kernel, so each filter gets its own codebook tuned to its own weight distribution.

The dependent claims are where the storage payoff becomes explicit, and they line up exactly with those G06F memory CPCs. Claim 2 has the processing resources "encode the plurality of weights as a 1D tensor" — flattening the kernel's weights into a single vector before clustering. Claim 3 determines "an index associated with the center point" for the weights in each cluster, which is the crux of the compression: once a weight has been assigned to a cluster, you no longer store the weight, you store a small index that points at the cluster's center. Claim 5 then stores that index in the shared cache memory — so the compressed representation lives in on-chip cache, exactly where you want it for fast, low-energy inference.

And the claim names a number. Claim 4 states "wherein the plurality of weights are compressed to 4 bits." That is a concrete, disclosed compression target: a 4-bit index per weight in place of a full-precision (typically 32-bit, or at least 16-bit) value. The arithmetic is the whole point — replacing a 32-bit weight with a 4-bit cluster index is an eight-fold reduction in the per-weight storage footprint, before any other overhead. Claim 6 wraps the same machinery in "an electronic device" that adds a general-purpose processor with one or more cores alongside the graphics processing device, and claims 7 through 10 mirror the 1D-tensor encoding, the per-cluster index, the 4-bit compression, and the cache storage on that device form. The patent therefore covers the technique both as a GPU subsystem and as a system-on-device.

Why does this matter strategically? Compression is the gating technology for running models where memory and power are scarce — phones, sensors, embedded silicon. The claimed value is in the storage representation: a per-kernel codebook, a 4-bit index, and the index parked in shared cache. Intel, whose business spans CPUs to edge accelerators, has a direct interest in owning methods that shrink models to fit its hardware, and the claim's framing as a graphics processing device with named caches and processing resources ties the IP to the silicon rather than to an abstract math trick. The G06F memory CPCs underline that the asserted contribution is layout and footprint, not a new accuracy result.

On scope: granted B2, enforceable, but the claims describe per-kernel K-means clustering implemented on a specific processing-device architecture, down to the 1D-tensor encoding, the per-cluster center index, and the 4-bit target. It is worth being precise about what the abstract calls "incremental network quantization," because the claims reframe it as a clustering problem rather than uniform bit-truncation. Plain quantization rounds every weight to the nearest value on a fixed grid; per-kernel K-means instead learns the grid from the data, placing cluster centers where each kernel's weights actually concentrate. That data-driven codebook is why the 4-bit target in claim 4 can hold accuracy where naive 4-bit rounding would not: sixteen learned centers per kernel, chosen to minimize within-cluster distortion, sit closer to the real weights than sixteen evenly-spaced levels would. The per-kernel scoping matters for the same reason — a layer's kernels can have very different weight distributions, so one global codebook fits none of them well while a separate set of centers per kernel adapts to each filter's local statistics. The cost is one small index table per kernel, but those tables are tiny next to the weight savings, and claim 5 keeps them in shared cache where the inference engine reaches them with low latency. Generic quantization, pruning, and other compression families are not captured — nor is K-means clustering done globally rather than per kernel. Claim 1's per-kernel structure and the device elements set the line.

The takeaway: US11055604B2 is a silicon vendor patenting the model-shrinking layer that decides whether a network runs on a small device at all — an unglamorous but decisive piece of the inference stack, claimed as concrete hardware that clusters each kernel's weights, stores 4-bit indices, and keeps them in cache.

Just Issued: Intel's 2021 Grant on Per-Kernel K-Means Compression for Networks

Comments