US-12579713-B2

Text-guided image editing by learning guidance scales via reinforcement learning

PublishedMarch 17, 2026

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Certain aspects of the present disclosure provide techniques and apparatus for improved machine learning. In an example method, a first latent tensor generated during a first iteration of processing data using a denoising backbone of a diffusion machine learning model is accessed. A guidance scale is generated based on processing the first latent tensor using a guidance machine learning model. A second latent tensor is generated during a second iteration of processing data using the denoising backbone based on the first latent tensor and the first guidance scale, and an output from the diffusion machine learning model is generated based at least in part on the second latent tensor.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A processing system comprising:

. The processing system of, wherein:

. The processing system of, wherein the one or more processors are configured to further execute the processor-executable instructions and cause the processing system to:

. The processing system of, wherein:

. The processing system of, wherein the first guidance scale is generated based further on processing a time step embedding corresponding to the first iteration using the guidance machine learning model.

. The processing system of, wherein:

. A processor-implemented method of image generation, comprising:

. The processor-implemented method of, wherein:

. The processor-implemented method of, further comprising:

. The processor-implemented method of, wherein:

Detailed Description

Complete technical specification and implementation details from the patent document.

Aspects of the present disclosure relate to machine learning.

A wide variety of machine learning models have been trained for a similarly vast assortment of tasks in recent years. For example, generative models (e.g., generative adversarial models (GANs), diffusion models, and the like) have been trained to generate new output data (e.g., images or text) based on input prompts. In some cases, generative models have been trained to enable input editing based on various prompts. For example, some models are able to receive an input image (e.g., a picture of a sailboat) and a textual prompt indicating how to edit or transform the image (e.g., “make the sail green”). The generative image editing model can generate an edited image that is similar to the reference image, but modified in accordance with the prompt (e.g., an image of a sailboat with green sails).

Certain aspects of the present disclosure provide a processor-implemented method, comprising: accessing a first latent tensor generated during a first iteration of processing data using a denoising backbone of a diffusion machine learning model; generating a first guidance scale based on processing the first latent tensor using a guidance machine learning model; generating a second latent tensor during a second iteration of processing data using the denoising backbone based on the first latent tensor and the first guidance scale; and generating an output from the diffusion machine learning model based at least in part on the second latent tensor.

Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer-readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.

The following description and the related drawings set forth in detail certain illustrative features of one or more aspects.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one aspect may be beneficially incorporated in other aspects without further recitation.

Aspects of the present disclosure provide apparatuses, methods, processing systems, and non-transitory computer-readable mediums for providing improved machine learning.

In some aspects of the present disclosure, machine learning models for text-guided image editing are provided. In such tasks, a machine learning model is provided with a reference image and a textual prompt or instruction. The model is tasked with generating an output image that preserves the original image while also fulfilling the textual instruction. Although text-based image editing is used in some examples, aspects of the present disclosure are readily applicable to a wide variety of other generative tasks, such as video editing, audio editing, editing inputs based on image and/or audio prompts (in addition to or instead of textual prompts), and the like.

Many guided diffusion architectures (e.g., classifier-free diffusion models) rely on scale hyperparameters to determine the influence of the guidance. Such scales (referred to in some aspects as guidance scales) are generally used to determine or control the amount of influence the prompt is given. For example, suppose a reference image depicting a marble statue is provided as input, as well as a textual prompt such as “turn the statue into a cyborg.” In some aspects, the guidance scale(s) define how much weight to give the prompt, as compared to the image. For example, low scales (e.g., low weight for the prompt) may cause the output to be very similar to the reference image, with minimal editing. In contrast, high scales (e.g., low weight for the prompt) may cause the output to be very faithful to the textual prompt, potentially sacrificing or losing substantial details from the original reference image.

In some systems, the scales are manually defined as hyperparameters (e.g., based on trial and error). For example, a user (e.g., a data scientist) may iteratively set the scale(s) to a given value (or set of values) and generate output(s). By visually evaluating each such output (generated using different scales), the user may subjectively select which guidance scale value(s) the user prefers. Such systems do not allow or enable any objective way to determine the optimal (or at least improved) guidance scales, let alone to determine appropriate scales on a per-sample basis. However, generation quality (e.g., the quality of the generated images) is highly sensitive to these scale hyperparameters, particularly when there are multiple forms of guidance.

In some aspects of the present disclosure, guidance scales can be dynamically generated for a given sample (e.g., based on the input image and/or prompt). In some aspects, the guidance can be varied as often as each iteration or time step (e.g., generating new guidance scales for each iteration of a denoising backbone), or more sparsely through the generation process of a sample (e.g., generating a single set of guidance scales that are used at each iteration).

In some aspects, guidance scales can be generated using a relatively small machine learning model (e.g., a neural network), referred to herein as a “guidance machine learning model,” that uses various data as input. Generally, the particular inputs may vary depending on the particular implementation. For example, in some aspects, the guidance machine learning model may process data such as latent tensor(s) being denoised, an embedding of the input reference image, an embedding of the input text prompt, a time step embedding indicating which iteration is being performed, and the like.

depicts an example workflowfor performing inverse diffusion using diffusion machine learning models with dynamic guidance scales, according to some aspects of the present disclosure. In some aspects, the workflowis performed by a machine learning system (e.g., a computing system configured to perform machine learning operations).

In the illustrated example, a prompt(e.g., a textual prompt) and a reference imageare processed using a diffusion model(also referred to in some aspects as a diffusion machine learning model) to generate a generated image. In some aspects, the promptcomprises natural language text indicating how the reference imageshould be modified or edited. For example, the reference imagemay depict a sailboat in the ocean, and the promptmay include “change the sails to blue.” In the illustrated example, the generated imagegenerally corresponds to the reference image, as modified based on the prompt. For example, the generated imagemay depict the sailboat with blue sails. As discussed above, in some aspects, the guidance scale(s) used by the diffusion modelmay affect the generated image. For example, low guidance scales for the promptmay result in a generated imagethat is highly similar to the reference image(e.g., the same sailboat, with the sails somewhat more blue), while high guidance scales may result in a generated imagethat is highly similar to the prompt(e.g., a sailboat with blue sails, but where other features such as details of the boat, the ocean, the background, and/or the like, may be changed).

As illustrated, the diffusion modelmay generally use two main operations to generate the generated image: a forward pass and a reverse or inverted pass. Generally, during the forward pass, the diffusion modeliteratively adds noise to the reference image. In some aspects, noise is added until the reference imageeffectively contains random (e.g., Gaussian) noise. During the reverse pass (also referred to as the inverse pass and/or the denoising operation), the diffusion modeliteratively removes the noise, conditioned based on the prompt, to yield the generated image. In some aspects, this reverse pass is performed using a denoising backbone of the diffusion model. As used herein, a “denoising backbone” refers to one or more components of a diffusion machine learning model that are used to denoise latent tensors to recover or generate a target output (e.g., an image). For example, one or more trained components (e.g., components that transform latent tensors based on parameters having values learned during a training phase) may be used to iteratively remove noise or and/or construct signal in the latent tensor based on various conditioning (e.g., based on textual prompts) to generate outputs.

Specifically, in the illustrated example, the promptand reference imageare first processed by an embedding operation, which generates a prompt embeddingand an image embedding (depicted as a latent tensorA), respectively. The embedding operationmay generally correspond to a trained component (e.g., an operation that uses parameters having values learned during training) that generates embeddings for input data. For example, the embedding operationmay project the input to a latent space, where each embedding is a relatively high dimension tensor (e.g., a vector having a relatively large number of values) in the latent space. In some aspects, the promptand reference imagemay be processed using separate embedding operations (e.g., a first component trained to generate text embeddings and a second component trained to generate image embeddings).

In the illustrated example, the embedding operationgenerates a latent tensorA. In some aspects, the latent tensorA corresponds to the reference image. That is, the latent tensorA may be the embedding of the reference image. As illustrated, the latent tensorA is then processed by a noising operationA to generate a latent tensorB. In some aspects, the noising operationA generally corresponds to adding at least some amount of noise to the latent tensorA (e.g., perturbing or changing one or more values in the latent tensorA). In some aspects, the noising operationA adds random Gaussian noise. In some aspects, the noising operationA is a trained component (e.g., adding noise based on parameters having values learned during training).

As illustrated, the latent tensorB is then processed using another noising operationB to generate another latent tensor having more noise than the latent tensorB. Generally, as discussed above, this forward diffusion process iteratively adds noise over multiple iterations (also referred to as time steps in some aspects) until a noisy latent tensorN is generated by the final noising operationN. As indicated by the ellipses, the diffusion modelmay use any number of iterations. Although depicted as discrete noising operationsA-N for conceptual clarity, in some aspects, the noising operationsmay use shared parameters. That is, the diffusion modelmay use the same noising operationto iteratively process the latent tensorsfor N iterations. In some aspects, this use of noising operation(s)to generate the latent tensorsN may be referred to as the forward diffusion process, as discussed above.

More generally, the forward diffusion process may be defined using P(x|x). That is, the latent tensor in a given iteration x(at iteration t+1) may be generated based on processing the latent tensor from the prior iteration (x) using a noising operation. By repeating this noising process for some number of iterations, the latent tensorN is generated. In some aspects, the latent tensorN corresponds to or comprises random (e.g., Gaussian) noise. By convention, the output of the final iteration of the forward diffusion process (e.g., the noising operationN) is referred to as the T-th output (e.g., after adding noise at time step T−1), and the first iteration (e.g., the noising operationA) is referred to as the 0-th iteration (e.g., adding noise at time step 0).

As illustrated, the latent tensorN is then processed using a denoising operationN, as well as by a guidance componentN. The guidance componentN processes the latent tensorN to generate a set of one or more guidance scales (e.g., values for the guidance scale(s)), which are provided to the denoising operationN. The denoising operationN processes the latent tensorN to generate a new (relatively denoised) latent tensor based at least in part on the guidance scale(s).

In the illustrated example, the denoising operationN further receives the embedding(generated based on the prompt) as input to generate the denoised latent tensor. Although not depicted in the illustrated example, in some aspects, each subsequent denoising operationmay similarly receive, as input, the embeddingof the prompt. For example, as discussed above, the guidance scales may indicate how much the latent tensor should reflect the prompt, as compared to how much the latent tensor should reflect the reference image. In some aspects, where multiple prompts are used, the guidance scales may indicate how much each prompt affects the output latent tensor. For example, suppose the promptincludes a first text prompt such as “add trees to the background,” “make it nighttime,” and “delete the red car.” In some aspects, the guidance scales may indicate a weight for each of these prompts, and these weights may all be different.

In some aspects, the denoising operationN and the guidance componentN are trained components (e.g., operations performed using parameters having values learned during a training operation). Although not depicted in the illustrated example, in some aspects, the guidance componentN may receive additional inputs to generate the guidance scales. For example, the guidance componentN may evaluate inputs such as the prompt(or the embedding), the reference image(or the embedding of the reference image, such as the latent tensorA), the guidance scale(s) used during a prior iteration of the denoising backbone (if any), and/or the like. In some aspects, the guidance componentN may be referred to as a guidance machine learning model.

As illustrated, a subsequent denoised latent tensor is then processed by a guidance componentB, which generates guidance scales that are input to a denoising operationB. The denoising operationB processes the input latent tensor and the guidance scales to generate a new latent tensorB. Further, the latent tensorB is processed by a guidance componentA, which generates a new set of guidance scales. These new guidance scales are used by the denoising operationA, along with the latent tensorB, to generate a latent tensorA.

Generally, as discussed above, this reverse diffusion process iteratively removes noise over multiple iterations (also referred to as time steps in some aspects) until the denoised latent tensorA is generated. As indicated by the ellipses, the diffusion modelmay use any number of iterations. Although depicted as discrete denoising operationsA-N for conceptual clarity, in some aspects, the denoising operationsmay use shared parameters. That is, the diffusion modelmay use the same denoising operationto iteratively process the latent tensorsfor N iterations. In some aspects, the denoising operation(s)may be referred to as the denoising backbone, as discussed above.

More generally, the reverse diffusion process may be defined using q(x|x). That is, the latent tensor in a given iteration x(at iteration t−1) may be generated based on processing the latent tensor from the prior iteration (x) using a denoising operation() which uses parameters θ (e.g., trained parameters having values learned during training). By repeating this denoising process for some number of iterations, the latent tensorA is generated. By convention, the final output of the reverse diffusion process (e.g., the denoising operationA) is referred to as the 0-th output (e.g., after removing noise at time step 1), and the first iteration (e.g., the denoising operationN) is referred to as the T-th iteration (e.g., removing noise at time step T).

Additionally, in some aspects, the guidance machine learning model may be defined using Π(λ|x). That is, the guidance scales for a given iteration λmay be generated based on processing the latent tensor from the prior iteration (x) using a denoising operation(Π) which uses parameters Θ (e.g., trained parameters having values learned during training). By repeating this denoising process for some number of iterations, the latent tensorA is generated.

In some aspects, as discussed above, each guidance componentmay further evaluate additional data, such as the embedding of the reference image, the embedding of the prompt, an embedding of the current iteration or time step, the guidance scale(s) from the prior iteration, and/or the like. For example, the guidance componentA may process the latent tensorB, the text prompt embedding, the reference image embedding (e.g., the latent tensorA), the time step embedding for the 0-th iteration (e.g., an embedding indicating that the 0-th iteration is currently being performed), and/or the guidance scales generated by the guidance componentB, to generate guidance scales for the denoising operationA in the final iteration.

In some aspects, as discussed above, multiple guidance scales may be generated for any given iteration. For example, the guidance componentsmay generate a separate scale for each prompt(if multiple prompts are used), a scale for the reference image, and the like. Although the illustrated example depicts generating new guidance scales for each iteration of processing data using the denoising backbone, in some aspects, the machine learning system may generate guidance scales more sparsely. For example, a set of guidance scales may be generated based on processing the latent tensorN using a guidance machine learning model, and these guidance scales may then be used for multiple (e.g., for all of the) denoising iterations.

In the illustrated example, the latent tensorA is processed by a decoding operationto generate the generated image. The decoding operationmay generally correspond to a trained component (e.g., an operation that uses parameters having values learned during training) that generates images based on input latents. For example, the decoding operationmay project the latent tensor from the latent space to the image space.

In some aspects, the workflowuses a pre-trained diffusion model. That is, some portions of the diffusion model(e.g., the embedding operation, the noising operation(s), the denoising operation(s), and/or the decoding operation) may be pre-trained components (e.g., components of a pre-trained classifier-free diffusion model). Rather than manually defining the guidance scales, the guidance component(s)may then be trained to generate guidance scales for the denoising process.

In some aspects, the guidance component(s)A-N are trained using one or more diffusion loss functions. For example, in some aspects, the input to the guidance component(s)A-N include the latent tensor generated for the previous time step or iteration (e.g., the latent tensorB generated by the denoising operationB is used as input to the guidance componentA). In some aspects, the guidance componentinputs further include the text conditioning (e.g., the embedding). As discussed above, the guidance componentuses these inputs to generate or predict the guidance scalar value(s), which are then consumed by the pre-trained diffusion model (e.g., the denoising operationduring the current iteration) to generate the next denoised latent output. In some embodiments, the final generated output from the diffusion model (e.g., the generated image) can then be compared against the ground-truth output (e.g., the target edited image) and a loss value can be computed using standard diffusion loss. This loss may then be used to refine the parameters of the guidance components.

depicts an example workflowfor merging trajectories using dynamic guidance scales in a denoising backbone of a diffusion model, according to some aspects of the present disclosure. In some aspects, the workflowis used by a machine learning system, such as the machine learning system discussed above with reference to.

In some aspects, the workflowdepicts the forward and reverse diffusion process for a diffusion model, such as the diffusion modelof. In the illustrated example, a diffusion model (e.g., a pre-trained model) may be utilized to commit an image through the forward process of diffusion, creating a sequence of latent tensors from x(latent tensor) to x(latent tensor). In some aspects, the noise maps (e.g., the maps used to add noise to the interim latent tensors at each iteration) are then used to create inverse noise maps z for each iteration. These inverse noise maps can then be utilized until a “skip” step threshold is reached, as discussed in more detail below. Then, the latent tensors may be edited by injecting a new edit caption to generate a new diffusion direction.

Specifically, the latent tensor(which may correspond to an image embedding for a reference image, such as the latent tensorA for the reference image) is processed by a first noising operationA (e.g., the noising operationA of) to generate an interim latent tensorA. The latent tensorA is then processed using a noising operationB to generate a latent tensorB, which is processed using a noising operationC to generate a latent tensorC. The latent tensorC is processed using a noising operationD to generate a latent tensorD, which is processed using a noising operationE to generate a latent tensorE. The latent tensorE is processed using a noising operationF to generate a latent tensorF, which is processed using a noising operationG to generate the latent tensor(e.g., the latent tensorN of).

As discussed above, in some aspects, each noising operationA-G may correspond to using a single noising component iteratively. Although seven noising iterations are depicted, in some aspects, the machine learning system may use any number of noising iterations, as discussed above. In some aspects, the number of noise iterations may be a hyperparameter of the diffusion model.

In the illustrated workflow, the machine learning system then performs the reverse diffusion process. As discussed above, in some systems, the text prompt(s) may be used to condition this reverse diffusion process for one or more iterations. However, in the illustrated example, the machine learning system may “skip” this conditioning for one or more iterations.

Specifically, in the illustrated example, the latent tensoris processed using a denoising operationA (e.g., the denoising operationN of) to generate or recover the latent tensorF. In some aspects, the denoising operationA corresponds to using the denoising backbone but not conditioning the diffusion using the prompt text (e.g., to recover the original reference image). In some aspects, the denoising operationA corresponds to removing the noise that was added during the noising operationG (e.g., based on the noise map from that iteration). In some aspects, rather than generating the latent tensorF, the machine learning system may instead store the latent tensorF during the forward diffusion, and retrieve this stored latent tensorF during the reverse diffusion process.

Further, as illustrated, the latent tensorF is processed using a denoising operationB to yield the latent tensorE, and the latent tensorE is processed using a denoising operationC to yield the latent tensorD. In some aspects, as discussed above, the denoising operationsB andC may correspond to processing the latents with a denoising process without conditioning based on the prompt, may correspond to applying the inverted noise maps generated during the forward pass, and/or may correspond to retrieving the latent tensorsE andD from storage or memory. That is, in some aspects, the denoising operationsB andC may perform the denoising operations without using the prompt (e.g., by using an empty string, rather than the prompt text).

In the illustrated example, the denoising iteration that receives the latent tensorD as input serves as a first skip step, where the diffusion trajectory is split. Specifically, as illustrated, the latent tensorD is processed using a denoising operationH to generate a latent tensorJ. As illustrated, this latent tensorJ diverges from the original trajectory of the forward pass, and represents the conditioning of the denoising. For example, the denoising operationH may be performed by conditioning the denoising based on the text prompt, as discussed above. Advantageously, by only beginning the prompt conditioning in this interim iteration (rather than for the entire denoising backbone), the machine learning system may generate output images that are more similar to the reference image, preserving the original structure and features.

As illustrated along this trajectory, the latent tensorJ is processed using a denoising operationI to generate a latent tensorK, which is then processed using a denoising operationJ to generate a latent tensorL. The latent tensorL is then processed using a denoising operationK to generate a latent tensorB (labeled x″). In some aspects, some (or all) of the denoising operationsH,I,J, and/orK may be performed as discussed above, using the prompt text(s) to condition the denoising in order to generate an edited image.

In the illustrated workflow, the latent tensorD is also processed using a denoising operationD to yield the latent tensorC. As discussed above, in some aspects, this denoising operationD follows the original trajectory of the forward pass, and is generally performed without conditioning the denoising using the text prompt. As illustrated, this iteration serves as a second skip step for the denoising. Specifically, as illustrated, the latent tensorC is processed using a denoising operationE to generate a latent tensorH. This latent tensorH also diverges from the original trajectory of the forward pass, and represents the conditioning of the denoising. For example, the denoising operationE may be performed by conditioning the denoising based on the text prompt, as discussed above.

As illustrated along this trajectory, the latent tensorH is processed using a denoising operationF to generate a latent tensorI, which is then processed using a denoising operationG to generate a latent tensorA (labeled x′). In some aspects, some (or all) of the denoising operationsE,F, and/orG may be performed as discussed above, using the prompt text(s) to condition the denoising in order to generate an edited image.

Generally, the latent tensorsA andB may correspond to different edits or revisions to the reference image (e.g., different diffusion trajectories), where the differences may be caused by different skip steps, different conditioning prompts, or a combination of different prompts and different skip steps. In some aspects, the latent tensorsA andB may each be processed by a decoder (e.g., the decoding operationof) to generate output images (e.g., generated imageof).

Although not illustrated in the depicted example, in some aspects, one or more of the denoising operations may also use guidance scales generated dynamically based on one or more of the latent tensors, as discussed above. For example, the denoising operationH may include use of guidance scales that are generated (at least in part) on the latent tensorD (e.g., using a guidance machine learning model, such as the guidance componentof).

In some aspects, a denoising trajectory (also referred to in some aspects as an inverse trajectory) may be defined based on the “skip step” (e.g., which iteration the machine learning system begins to condition the denoising based on the text prompt). For example, one trajectory may have a skip step at the N-th iteration, while another trajectory has a skip step at the M-th iteration. Generally, skip steps nearer to the beginning of the denoising process result in outputs that are closer to the prompt and/or further from the reference image, as compared to skip steps nearer to the end of the denoising process. In some aspects, a trajectory may additionally or alternatively be defined based on the conditioning that is used. For example, one trajectory may correspond to conditioning the denoising based on a first text prompt, while a second trajectory corresponds to conditioning the denoising based on a second text prompt. Generally, aspects of the present disclosure can be used to merge or combine any number of trajectories, regardless of how and when those trajectories diverged in the denoising process.

In some aspects, the guidance model(s) can be used to combine different inverse trajectories via compositionality based on different scaling factors. That is, the guidance model may be used to generate guidance scales that are used to aggregate latent tensors from different trajectories. For example, the latent tensorsA andB may be combined using dynamically generated guidance scales (e.g., generated by the guidance model based on input such as the latent tensorsA-B) to weight the latent tensorsA andB. This combined latent tensor may then be decoded (e.g., using the decoding operationof) to generate an output image.

As another example, in some aspects, different trajectories may be merged for a given time step or iteration using the guidance scales, and the resulting merged latent tensor can then be used as input to the next time step or iteration. For example, the latent tensorsC andJ may be merged using dynamic guidance scales, and the resulting aggregated latent tensor can be used as the current latent for the next iteration. In some aspects, at the next time step or in the next iteration, the merged or aggregated latent tensor can be used to compute or generate new latent tensor(s) corresponding to different trajectories, and these new latent tensors can then be similarly merged (using dynamic guidance scales) before proceeding to the following iteration.

In this way, different latent tensors may be generated in a given iteration (e.g., using different conditioning based on different text prompts or sub-prompts), and these latents can be merged using dynamic guidance scales. The merged tensor can then be used to generate new latents (corresponding to new trajectories) for the next iteration, and so on until the final iteration is completed.

is a flow diagram depicting an example methodfor generating images using diffusion models and guidance machine learning models, according to some aspects of the present disclosure. In some aspects, the methodis performed by a machine learning system, such as the machine learning system discussed above with reference to.

At block, the machine learning system accesses a reference image (e.g., the reference imageof) and a prompt (e.g., the promptof). In some aspects, as discussed above, the reference image is used as the base or starting point of the desired model output, while the prompt indicates the desired changes, modifications, or edits. For example, as discussed above, the reference image may depict objects such as animals, and the prompt may include natural language textual data requesting that modifications such as the number, size, color, or other visual aspect of the animals be changed. As used herein, “accessing” data may generally include receiving, requesting, retrieving, obtaining, collecting, generating, or otherwise gaining access to the data. For example, the machine learning system may access the reference image and prompt as input from a user.

Patent Metadata

Filing Date

Unknown

Publication Date

March 17, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search