Here's what published — published is not granted. Application US20230076095A1, "Synthetic Data Generation for Training of Natural Language Understanding Models," published March 9, 2023, assigned to Microsoft Technology Licensing, LLC, inventors including Jianfeng Gao. The CPC codes sit in the spoken-language class: G10L 15/18, G10L 15/22, G10L 15/083.
The mechanism is manufacturing training data. Natural-language-understanding models — the components that parse intent and slots in a user utterance — are hungry for labeled examples, which are expensive to collect. This application's approach is to generate synthetic training data: produce varied, labeled examples programmatically so the NLU model sees enough diversity to generalize without a massive human-annotation effort.
This is the same theme that recurs across the efficiency-and-data side of AI IP: the data pipeline is often where models quietly succeed or fail, and methods that produce useful synthetic data are valuable precisely because real labeled data is the binding constraint. For Microsoft, whose assistants and enterprise products depend on robust NLU, owning a synthetic-data method protects the unglamorous step that feeds the models.
Because this is a publication, the framing is intent, not entitlement. The claims as filed describe what Microsoft seeks; the allowed claims, if a grant issues, set the enforceable scope. Until then there is nothing to assert, and the title is a label, not a monopoly.
The takeaway: US20230076095A1 is a published marker on the data-generation side of the NLU stack — a reminder that for every headline model architecture, the labs are also patenting the synthetic-data plumbing that trains it, and that an A1 document is a signal of direction, not a right.