Just Issued: NVIDIA's Speech-Synthesis Patent Claim 1

A granted patent — issued, not merely published — covers a speech vocoder trained by mapping audio to Gaussian values through invertible layers, then run in reverse to synthesize voice. Here is what the independent claim requires, and where it lands in the G10L/G06N landscape.

First, the label, because it changes what the document is. US12670895B2, "Invertible neural network to synthesize audio signals," is a granted patent — kind code B2 — that issued on June 30, 2026 and is assigned to NVIDIA. It is not a published application awaiting examination; it is an issued grant, which means an examiner has allowed the claims and they now define enforceable scope. Read the abstract and you will learn what the disclosure is about; read claim 1 and you will learn what actually issued. Those are different questions, and on a grant the second one is the one that matters.

Here is what the independent claim requires, verbatim.

One or more processors, comprising: circuitry to: train one or more neural networks to infer one or more speech features of one or more voice audio signals to be generated based, at least in part, on training that uses speech of the one or more voice audio signals to train the one or more neural networks to infer speech features, wherein the one or more neural networks are trained by: converting a compact representation of a first audio signal; generating one or more Gaussian values based at least in part on the converted compact representation and the first audio signal; and training the one or more neural networks using the one or more Gaussian values; and generate the one or more voice audio signals with the trained one or more neural networks.— Invertible neural network to synthesize audio signals, US12670895B2

Reading claim 1 element by element

Strip it to its limitations and claim 1 is an apparatus claim — "one or more processors, comprising: circuitry to" — that requires four things to happen. The circuitry must (a) train one or more neural networks to infer speech features of the voice audio to be generated; (b) as part of that training, convert a compact representation of a first audio signal; (c) generate one or more Gaussian values based on that converted representation and the first audio signal, and train the networks using those Gaussian values; and (d) generate the voice audio with the trained networks. Two features are worth flagging for scope. The claim is directed to the training-and-generation apparatus as a whole, not to a standalone runtime decoder — the Gaussian-values step and the training step are inside the claim. And the compact representation is left generic in claim 1; it is not limited to any particular feature format at the independent level. What that representation is gets pinned down only in the dependents.

The dependents are where the mechanism most readers associate with this patent actually lives. Claim 2 requires that "the compact representation is a mel-spectrogram" — the frequency-over-time feature map that conditions the model. Claim 3 requires the Gaussian values to be "generated using one or more invertible layers of the one or more neural networks," which is the reversibility that the title advertises but claim 1 does not itself recite. Claim 4 narrows further: the invertible layers "include an audio transform that uses dilated convolutions" — the long-context filter arrangement. And claim 5 adds the directional property: the networks "are to be trained in a first direction and to generate inferences in a second direction." So the well-known flow-vocoder characteristics — mel-spectrogram conditioning, invertible coupling layers, dilated convolutions, train-forward-infer-backward — are claimed, but as dependent limitations layered onto a broader independent claim. Anyone reading scope should keep the independent claim and its narrowing dependents distinct.

The independent-claim family and the CPC landing

The grant carries several parallel independent claims, which is worth noting because they cover the same invention in different statutory categories. Claim 1 and claim 18 are directed to "one or more processors"; claims 6 and 23 to a "system"; claim 12 to a "speech synthesis system"; and claim 29 to a "computer-implemented method." Each recites the same core steps — convert a compact representation, generate Gaussian values, train on them, generate voice audio. The categories matter for how a claim reads onto a product versus a process. One dependent is easy to miss and telling about intended use: claim 17 specifies that "the speech synthesis system comprises a vehicle," placing an in-car voice system expressly within the claimed embodiments.

On classification, the patent lands with a lead CPC symbol of G10L 13/047 — within G10L (speech analysis and synthesis), the subgroup for the details of speech synthesis — and adds G06N 3/045 and G06N 3/047 for the neural-network architecture itself. That pairing is the substance in shorthand: a speech-synthesis method (G10L 13) implemented as a specific neural architecture (G06N 3/04x). It sits in the same G06N neighborhood that defines the broader machine-learning patent landscape, but its G10L lead marks it as a vocoder grant specifically, not a general-purpose network.

Context from the same issue date reinforces where this record sits. NVIDIA received a cluster of generative-media grants on June 30, 2026, several sharing the same inventor line. US12670600B2, "Disentanglement of image attributes using a neural network," is directed to unsupervised keypoint learning that reconstructs an input image from pose and appearance. US12670541B2 claims a temporally stable recurrent network for reconstructing motion blur and depth-of-field effects, and US12670543B2 claims an API that indicates neural-network frame-interpolation support. These are separate grants with their own claim sets; the audio-synthesis patent is the one whose independent claim reads as a training-plus-generation apparatus for speech.

For a Just Issued read, the disciplined takeaway is narrow. What issued is an apparatus (and matching system and method) claim covering training a neural network by converting a compact audio representation, generating Gaussian values from it, training on those values, and generating voice audio — with the mel-spectrogram, invertible layers, and dilated convolutions living in the dependents rather than the independent claim. It is granted, not pending, so the scope is defined as allowed rather than as filed. Whether any particular product reads on claim 1 is a question of matching each limitation to an accused implementation, and this brief takes no position on that. The record establishes what the claim requires; that is the fact worth fixing before anyone reasons about scope.

Just Issued: What Claim 1 of NVIDIA's Invertible Speech-Synthesis Patent Actually Covers

Reading claim 1 element by element

The independent-claim family and the CPC landing

Comments