On June 25, 2026 a patent application published that takes a job long handled by hand-tuned signal-processing metrics — deciding whether an image is any good — and hands it to a vision language model. Titled "Evaluating Image Data For One Or More Imperfections Using A Vision Language Model," the application is assigned to NVIDIA Corporation and carried at US20260179370A1. Before anything else, the label that matters: this is a published application, not a granted patent. It records what NVIDIA filed and asked an examiner to consider, not anything it can yet enforce. With that fixed, the question worth asking is what the application is actually directed to.
Read at the level of claim 1, the disclosed approach is a four-step pipeline. A system obtains image data for at least one image that includes one or more imperfections; it determines a query — a text prompt — based on that image, configured to make a vision language model (VLM) generate an output indicating the imperfections; it provides both the image data and the query to the VLM so the model produces that output; and it provides data on a degree of quality of the image, derived from the model's output. The independent claim is written in apparatus form ("one or more processors comprising one or more circuits to…"), with parallel independent claims recasting the same operations as a system (claim 11) and as a method (claim 20). The core move is the same across all three: instead of computing a quality score with a fixed algorithm, the system asks a language-capable vision model, in natural language, what is wrong with the picture.
In various examples, systems and methods are disclosed that relate to the evaluation of image data for imperfections using a vision language model (VLM). For example, a system can be configured to obtain image data associated with at least one image that includes one or more imperfections. The system can determine a query based at least on the at least one image that is usable to prompt the VLM when processing the image data. In some embodiments, the system can provide the image data and the query to the VLM to cause the VLM to generate the output in accordance with the query, indicating whether one or more imperfections are present in the at least one image.— Evaluating Image Data For One Or More Imperfections Using A Vision Language Model, US20260179370A1
What the dependent claims narrow it to
The dependent claims fill in the kinds of imperfection and the prompting strategy, and they are where the scope gets concrete. Claim 4 enumerates the degradations the VLM can be prompted to flag — "noise, blur, distortion, artifacts, chromatic aberration, vignetting, or aliasing" — which is a roster of classic camera and rendering defects rather than semantic errors. Claims 5 and 6 are directed to multi-channel images, where a pixel carries a first and second channel associated with different portions of the electromagnetic spectrum, and the query is built from one channel or the other. Claims 7 and 8 are directed to using image annotations: the second channel carries annotation data, and the query is configured to prompt the VLM to flag mismatches "indicative of segmentation errors" — a tell that one intended use is inspecting labeled training data, not just photographs. Claim 9 adds an iterative loop: a first query detects a first imperfection, then a second query is generated from that result to probe for further imperfections, and the quality score is determined from the second output. Claims 2 and 3 are directed to determining a context from a feature of the image — including context carried over from a prior image in a sequence — and conditioning the query on that context. The picture that emerges from the dependents is a system that adapts its prompt to what it is looking at and can interrogate an image in more than one pass.
The CPC class, and where it lands
The classification is the tell for the landscape. The hero application's main CPC is G06V 10/82 — the subclass within computer vision (G06V) for arrangements that use neural networks for image or video recognition and understanding. That placement is consistent with the disclosure: the claimed contribution is not a new optical metric or a hardware imaging change but a neural-network-based recognition method, where the network in question is a vision language model and the recognition target is the presence and type of an imperfection. It sits in the same G06V neighborhood the AI sector's vision and multimodal filings cluster in, and it reads as a generative-AI take on the long-standing problem of automated visual inspection: rather than a bespoke classifier trained per defect, a general VLM is prompted to describe defects in language and the language is mapped to a quality score.
Claim 10 makes the breadth of intended deployment explicit, listing the systems the processors may be "comprised in" — among them a control or perception system for an autonomous or semi-autonomous machine, a robot, an aerial system, a medical system, a system for generating synthetic data, a system performing generative-AI operations, and a system implemented at least partially in a data center. That kind of omnibus environment clause is standard NVIDIA claim drafting; read narrowly, it signals the intended fields of use — autonomous-machine perception, synthetic-data generation, and data-center AI — without itself narrowing what the independent claim covers. The operative scope still lives in the four steps of claim 1.
Read against NVIDIA's own recent record, the hero does not publish alone. This week's drop and the surrounding cluster carry a run of related NVIDIA applications. US20260179374A1 and US20260178914A1 sit adjacent in the same publication run, and US20260179239A1 falls in the same numerical neighborhood. Slightly earlier in the sequence are US20260177400A1, US20260175877A1, and US20260178890A1. Taken together, the surrounding records are consistent with NVIDIA's broader filing pattern in computer vision and generative AI — the same assignee that supplies the accelerators these models run on filing on the methods that run on them.
What the hero application claims, then, is narrower and more specific than "use AI to check images." It is a VLM-prompting pipeline: obtain an image with possible imperfections, generate a query from the image, run the image and query through a vision language model, and return a quality score derived from the model's output — with dependent claims pinning the defect types to a defined list, adding multi-channel and annotation-aware prompting, and allowing an iterative second pass. Whether the claims that ultimately issue track the published language is a question for prosecution, and the scope an examiner allows may be narrower than the disclosure reads. For the purpose of reading the record as filed, the coverage is plain on its face: a generative-AI visual-inspection method, classified under G06V 10/82, that asks a vision language model what is wrong with a picture and turns the answer into a number.
Comments
Loading comments…