Just Issued: Salesforce Vision-Dialogue Transformer Patent

A January-2023 Salesforce grant unifies vision and dialogue in one BERT-based transformer. Multimodal IP at the claim level.

Here's what actually issued. On January 24, 2023, Salesforce.com, Inc. was granted US11562147B2, "Unified vision and dialogue transformer with BERT," inventors Yue Wang, Chu Hong Hoi, and Shafiq Joty. The CPC codes pair dialogue and language classes (G06F 40/35, G06F 40/284) with vision matching G06K 9/6217 and network learning G06N 3/08.

The mechanism is multimodal unification. Vision-and-dialogue tasks — think discussing the contents of an image across a back-and-forth conversation — traditionally needed separate models for seeing and for talking. This grant unifies them in one transformer built on BERT, so a single model jointly handles the visual input and the dialogue turns. Fusing modalities in one architecture is the direction the field moved as multimodal models became central.

“A visual dialogue model receives image input and text input that includes a dialogue history between the model and a current utterance by a human user.”— U.S. Patent No. 11,562,147 source

Reading the independent claim shows how concrete the unification is. Claim 1 describes a visual dialogue neural network language model that receives an image input and a text input, where the text input is itself two things bundled together: a dialogue history between the model and the user, and the user's current utterance. From those inputs and a transformer encoder network, the model generates what the claim calls a "unified contextualized representation" — a single representation that includes a token-level encoding of both the image and the text. That word "unified" is doing real work: rather than running a vision tower and a language tower in parallel and stitching their outputs together late, the claim folds both modalities into one token sequence the transformer attends over jointly.

The claim then walks through the encoding stack. From the unified representation, a set of "visual encoding layers" produces an "encoded visual dialogue input" that carries a position-level encoding and a segment-type encoding — the same positional and segment machinery BERT uses for text, here extended to mark visual versus textual spans. The system claim (claim 11) is more explicit about the fusion order: it generates an encoded image input with token, position, and segment encodings, generates an encoded text input with the same, then concatenates the two into a single input sequence before a pre-trained language model with self-attention mask layers produces the response. Notably, the position-level encoding for the image identifies "a spatial level ordering of frames" and the "spatial ordering of spatial regions within each frame" — so the model is told not just that a token is visual, but where in the image, and in which frame, it sits.

The most distinctive disclosed mechanism is the dual self-attention mask. Claim 1 produces the answer prediction using either "a first self-attention mask associated with discriminative settings" or "a second self-attention mask associated with generative settings." In the generative case, a first subset of settings is set to zero values that allow each token in the context sequence to attend to every other — bidirectional context — while claim 4 sets a second, different subset to non-zero values for attending one or more tokens "ahead of a subject token in an answer sequence," which is the autoregressive, left-to-right constraint generation needs. The same encoder, in other words, is switched between fully-visible discriminative attention and partially-masked generative attention by toggling mask values. That is how one BERT-style encoder covers both the "pick the right answer from candidates" task and the "write the answer" task without two separate models.

The dependent claims fill in the training and ranking detail. Claim 3 masks a random subset of tokens (including special tokens) and replaces them with a mask token using masked language modeling — classic BERT pretraining, applied across the fused image-text sequence. Claim 5 adds a next-sentence-prediction operation that judges whether an "appended answer" in the text input is correct, and claim 7 uses the final hidden vector of a special token as the head for that binary classification. Claims 8 through 10 describe a ranking module that emits "dense annotations" — relevance scores over multiple answer candidates — then combines and normalizes those scores into a probability distribution to fine-tune the ranking. The abstract calls this "dense annotation fine tuning" to "increase accuracy of the answer prediction." None of this is hand-waving about multimodality; it is a specific recipe for encoding, attending to, generating, and ranking answers in a visual conversation.

The CPC footprint is the tell. The simultaneous presence of a vision code and dialogue/language codes flags genuine multimodality rather than a vision model with a text label bolted on. For Salesforce, whose research lab produced a steady stream of multimodal and dialogue work in this period, the grant secures a unified-architecture method right as visual conversation — answering questions about an image over several turns — was becoming a real product surface. The value is in the architecture choice: one encoder, one token stream, two attention modes.

On scope: granted B2, enforceable, but the claims describe a specific unified vision-dialogue transformer with named components — the unified contextualized representation, the position and segment encodings spanning image and text, and the discriminative/generative mask pair. They do not cover multimodal models generally, nor BERT, nor every visual-dialogue system. The independent claims set the line, and that line runs through the dual-mask, single-encoder design, not the broad idea of mixing pictures and words.

The takeaway: US11562147B2 is early multimodal IP arriving in granted form — a single-transformer method that fuses seeing and talking into one token sequence and flips between discriminative and generative attention with a mask, held by a software company whose research consistently fed its product roadmap.

Just Issued: Salesforce's 2023 Grant on a Unified Vision-and-Dialogue Transformer

Comments