Just Issued: Salesforce End-to-End Speech Patent (2020)

A February-2020 Salesforce grant trains a speech recognizer end to end with a policy-learning objective. What that combination actually claims.

Here's what actually issued. On February 25, 2020, salesforce.com, inc. was granted US10573295B2, "End-to-end speech recognition with policy learning," inventors Yingbo Zhou and Caiming Xiong. The CPC list combines speech classes (G10L 15/063, G10L 15/14, G10L 15/16) with neural-network learning G06N 3/084 and probabilistic G06N 7/005.

Two ideas are stacked here. "End-to-end" means the system maps audio directly to text in a single trained model, rather than chaining separate acoustic, pronunciation, and language components — the older, more brittle pipeline. "Policy learning" borrows from reinforcement learning: instead of only minimizing a frame-level error, the model is trained against an objective tied to the quality of the final transcription, treating recognition decisions as actions to optimize.

“The disclosed technology teaches a deep end-to-end speech recognition model, including using multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions.”— U.S. Patent No. 10,573,295 source

The independent claim defines the combination tightly. Claim 1 trains the model with "a multi-objective learning criteria" that, at each of "one thousand to millions of backpropagation iterations," combines two specific functions. The first is "a maximum likelihood objective function that modifies the model parameters to maximize a probability of outputting a correct transcription" — the standard supervised target. The second is "a policy gradient function that modifies the model parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions." That phrase — "non-differentiable performance metric" — is the crux. Word error rate, the metric people actually care about, is not differentiable, so you cannot directly backpropagate through it. Policy-gradient reinforcement learning is exactly the tool that lets you optimize a non-differentiable reward, and the claim uses it to pull training toward the real objective while the maximum-likelihood term keeps the model stable and trainable.

The dependent claims name the metrics and the algorithms, leaving little to the imagination. Claim 5 specifies the performance metric "is word error rate (abbreviated WER)," and claim 6 defines the reward function as "1−WER" — a literal formula. Claim 7 offers "character error rate (CER)" as an alternative metric. Claim 3 makes the maximum-likelihood term a "connectionist temporal classification (CTC) objective function," the canonical end-to-end speech loss, and describes how it produces an output transcription by selecting the most probable label per timestep and multiplying softmax probabilities. Claim 4 describes how the policy-gradient term gets its reward: it "independently samples a transcription label for each timestep," concatenates the sampled labels into an output transcription, and measures the difference from ground truth using the performance metric. Claim 9 names the technique as "self-critical sequence training (SCST)" — a specific, published policy-gradient variant that uses the model's own greedy output as the baseline. So the patent is not gesturing at "RL for speech"; it is claiming CTC-plus-SCST with a 1−WER reward, evaluated by per-timestep sampling.

One more disclosed detail shows real training craft. Claim 10 specifies that "relative reliance on the maximum likelihood objective function and the policy gradient function shifts during training, with greater emphasis on the maximum likelihood objective function early in training than late in training." That is a scheduling strategy: lean on the stable supervised loss early when the model is weak, then increasingly weight the reward-driven policy gradient as the model becomes good enough for its sampled transcriptions to be informative. Claim 2 adds the per-timestep softmax detail — "a normalized distribution of softmax probabilities over a set of transcription labels, including a blank label" — the blank label being the hallmark of CTC alignment. The system claim (claim 11) wraps all of this into a "deep end-to-end speech recognition processor" with "convolution layers and recurrent layers," fixing the architecture as a convolutional-recurrent stack.

The combination is the point. Optimizing end-to-end against a reward-like signal — specifically a 1−WER reward via self-critical sequence training, scheduled to grow over training — aligns the optimization with what users actually care about, accurate transcripts, rather than only a proxy frame loss. For Salesforce, whose research arm produced a steady stream of NLP and speech work in this period, the grant secures a method that ties a modern training objective to a modern convolutional-recurrent architecture, with the metric and the RL variant pinned down in the claims.

On scope, the discipline applies. Granted B2, enforceable, but the claims describe a specific end-to-end-plus-policy training method — the CTC maximum-likelihood term, the SCST policy-gradient term over a non-differentiable WER/CER reward, the per-timestep sampling, and the early-to-late weighting shift. They do not cover end-to-end speech recognition generally, nor reinforcement learning generally. The independent claim, with its "1 thousand to millions" of iterations combining the two named objectives, is the boundary.

It is worth pausing on why the "non-differentiable performance metric" framing is the technical core rather than incidental wording. Conventional neural-network training relies on backpropagation, which requires every step from input to loss to be differentiable so gradients can flow. Word error rate breaks that chain: it is computed by an edit-distance alignment between predicted and reference transcripts, a discrete operation with no useful gradient. For years that forced speech systems to optimize a differentiable surrogate — cross-entropy or CTC loss — and simply hope it correlated with WER. The policy-gradient term in claim 1 is what lets the model optimize WER itself: by sampling a transcription (claim 4's per-timestep label sampling), scoring it with the real metric, and treating the score as a reward, the method estimates a gradient for a quantity that has none in the usual sense. The self-critical sequence training of claim 9 sharpens this by using the model's own greedy decode as a baseline, which lowers the variance of that reward estimate so training stays stable. And the curriculum of claim 10 — heavier maximum-likelihood weight early, heavier policy-gradient weight late — reflects a practical truth: a half-trained model's sampled transcripts are too noisy to give a useful reward, so you lean on the supervised loss until the samples are good enough to learn from. Each named element addresses a concrete obstacle to optimizing the metric users actually feel.

The takeaway: US10573295B2 is a good example of how 2020-era speech IP blended sequence modeling with reinforcement-style objectives — CTC paired with a 1−WER policy gradient via self-critical sequence training — and of how a software company's research lab converts published techniques into enforceable, method-specific grants.

Just Issued: Salesforce's 2020 Grant on End-to-End Speech Recognition With Policy Learning

Comments