Landscape Report: Vision-Language AI Patents (G06V) | AlgorithmClaims

Multimodal AI rests on a cluster of vision-language grants spread across several assignees. Mapping it shows a shared mechanism and a contested field.

Across the portfolio, vision-language is the opposite of mixture-of-experts: where MoE's foundational grants concentrate in one assignee, the vision-language cluster is spread across several. That distribution is the headline finding, and it tells you this subfield was contested early by multiple major labs rather than anchored by one originating patent.

Start with the standout record. Salesforce's US12462592B2, "Systems and methods for a vision-language pretraining framework" (granted November 4, 2025; inventors Junnan Li and Chu Hong Hoi), claims a pretraining framework — the objectives a model optimizes so it learns a joint image-text understanding rather than two disconnected skills. Its CPC spread runs across G06V 20/70, G06V 10/764, G06V 10/774, and several G06F text codes: a vision grant with language machinery attached, which is exactly what "vision-language" should look like in the classifications.

“Embodiments described herein provide a multimodal vision-language model. The multimodal vision-language model contains a Generalist Multimodal Transformer capable of complete multiple tasks using the same set of parameters learning from pre-training.”— U.S. Patent No. 12,462,592 source

Next to it, Microsoft's US12518512B2, "Training vision models with unified contrastive learning" (granted January 6, 2026; inventors Yuan, Li, Yang, Xiao), claims the contrastive approach — pulling matching image-text pairs together in the representation and pushing mismatched ones apart. CPC G06V 10/774 and G06T 9/00. And Google's US12387510B2, "Instance level scene recognition with a vision language model" (granted August 12, 2025), applies a vision-language model to recognizing specific instances in a scene — CPC G06V 20/70, G06V 20/41, G06V 10/764. Three assignees, three angles, one mechanism: align modalities in a shared space, then specialize.

Reading the cluster by CPC rather than by company is what makes the pattern legible. All three sit heavily in G06V 10/7xx and 20/7x — the recognition and training subgroups of the vision class — with text-handling G06F codes riding alongside. That shared classification footprint is the structural evidence that these are variations on a common technique, not unrelated inventions that happen to mention images and words.

The analyst's caveats, as always. These counts and clusters are filing-date sensitive: vision-language is one of the fastest-growing areas in AI IP, and new grants and applications publish constantly, so any snapshot ages quickly. And holding a grant in this cluster does not establish market position — patent counts measure rights, not products or revenue. A landscape map shows where the rights sit; it does not adjudicate who practices the technique best.

The whitespace question follows naturally from the map. With pretraining frameworks (Salesforce), contrastive training (Microsoft), and applied scene recognition (Google) each spoken for by a granted record, later entrants are pushed toward the edges — efficiency variants, on-device deployment, domain-specific recognition. Where the granted cluster is dense, the open ground is at its margins. That is where the next wave of vision-language applications will likely be filed.

Landscape Report: The Vision-Language Patent Cluster Across Salesforce, Microsoft, and Google

Comments