Just Issued: Microsoft Reinforcement-Learning Patent

A December-2022 Microsoft grant on a reinforcement-learning agent that generalizes across tasks. What claim 1 covers.

Here's what actually issued. On December 13, 2022, Microsoft Technology Licensing, LLC was granted US11526812B2, "Generalized reinforcement learning agent," with inventors including Katja Hofmann and Sam Devlin from Microsoft's game-AI research. The single CPC code is G06N 20/20 (ensemble machine learning), a tight classification that itself hints at the approach.

Reinforcement learning trains an agent to act by trial and error, maximizing a reward signal. The hard, valuable problem is generalization: an agent that masters one task or environment usually fails to transfer to a new one. The title's word "generalized" points squarely at that — a method aimed at producing an agent whose learned behavior carries across tasks rather than overfitting to a single one.

“An apparatus has a memory storing a reinforcement learning policy with an optimization component and a data collection component.”— U.S. Patent No. 11,526,812 source

The independent claim names the specific structural idea behind that generalization, and it is more precise than the title. Claim 1's apparatus stores a reinforcement learning policy that has two named parts: "an optimization component" and "a data collection component." The key element is "a regularization component configured to apply regularization selectively between" those two components. That selectivity is the invention. The policy that collects experience (explores the environment, gathers data) and the policy that is optimized (the one being driven toward high reward) are regularized differently — and the dependent claims make the asymmetry explicit. Claim 3 applies "no regularization to the data collection component and applies regularization to the optimization component." Claim 4 applies "more regularization to the optimization component than to the data collection component." The intuition this captures: heavily regularizing the optimization target keeps it from overfitting to one task (which is what kills generalization), while leaving data collection unregularized keeps exploration rich. Splitting regularization across the two roles is precisely how the claim attacks the transfer problem.

The processor in claim 1 then runs a reinforcement learning process with a transfer test baked into the claim language: it triggers execution of an agent "with respect to a first task," observes the agent's observation space and action, updates the policy "taking into account the regularization" by computing a loss function — and then triggers the agent "according to the updated reinforcement learning policy and with respect to a second task, wherein the second task is different from the first task." So the claim does not merely assert generalization as a goal; it requires the same learned policy to be deployed on a second, different task. That cross-task deployment is an element of the independent claim, which is why this reads as transfer IP rather than vanilla RL.

The math is in the claims too, which is unusual and worth noting. Claim 7 specifies the reinforcement learning process as computing "a loss of the policy LACIB plus a first weight λv times a loss LACV of the critic minus a second weight λH times a heuristic entropy bonus of the policy plus a Lagrangian multiplier hyperparameter β times a regularization term." That is a fully written-out actor-critic objective: a policy loss, a weighted critic loss, an entropy bonus that encourages exploration, and — the distinctive term — a Lagrangian-multiplier-controlled regularization term. The Lagrangian multiplier β is how the "selective regularization" gets dialed: it is the knob that sets how hard the optimization component is constrained. Claim 8 confirms the process "is an actor-critic process," and claims 5 and 6 describe the regularization methods concretely — restricting "a capacity of the machine learning model," selecting its architecture, or "stochastic regularization whereby noise is added to the machine learning model."

The dependent claims also reveal the intended deployment surface, and it is broad. Claim 9 has the agent be "a physical entity" operating in a first and then a different second physical environment — robotics transfer. Claim 10 makes the agent "a chat bot" applying a skill across different situations. Claim 11 makes it "a player in a computer game," and claim 12 lists the agent as any of "a robotic vacuum cleaner, a manufacturing robot arm, a chat bot, an avatar in a video game." That enumeration ties the abstract regularization trick to concrete products — from factory arms to game characters to conversational agents.

Why does Microsoft hold this? Its research has long used games as testbeds for RL, and generalizable agents are the bridge from game-playing demos to useful, transferable decision-making systems. Owning a method for cross-task generalization — specifically the selective-regularization, Lagrangian-controlled actor-critic formulation in these claims — is owning a piece of the path from narrow RL to broadly capable agents, and the dependent claims stake that path across robots, bots, and games.

On scope, the tight CPC and the claim discipline together: granted B2, enforceable, but the claims describe a specific approach — an optimization/data-collection split with regularization applied selectively (and asymmetrically) between them, an actor-critic loss with a named Lagrangian regularization term, and required deployment on a second different task. "Generalized reinforcement learning" in the title is not a claim to all of transfer learning or all RL; the selective-regularization mechanism is what is fenced. Read the allowed claim language, especially the λ/β loss in claim 7.

The takeaway: US11526812B2 is a method-specific RL grant from a lab with deep game-AI roots, addressing the field's central generalization problem by regularizing the optimization and data-collection halves of the policy differently — and a reminder that a narrow CPC like G06N 20/20 is itself a clue to how the claims are framed.

Just Issued: Microsoft's 2022 Grant on a Generalized Reinforcement-Learning Agent

Comments