Patentable/Patents/US-20250348674-A1

US-20250348674-A1

Distributing Prompt Processing in Generative Artificial Intelligence Models

PublishedNovember 13, 2025

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Certain aspects of the present disclosure provide techniques and apparatus for generating responses to large input prompts using a generative artificial intelligence model. An example method generally includes receiving an input prompt for processing using a generative artificial intelligence model. The input prompt is partitioned into a plurality of sub-prompts based on contextual information associated with tokens in the input prompt. A response to the input prompt is generated using the generative artificial intelligence model based on the plurality of sub-prompts and the contextual information associated with the tokens in the input prompt. The generated response is output.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A processing system, comprising:

. The processing system of, wherein:

. The processing system of, wherein the respective breadth metric comprises an indication of whether the respective token corresponds to a global concept in the input prompt or one or more local concepts in the input prompt.

. The processing system of, wherein to partition the tokens in the input prompt, the one or more processors are configured to cause the processing system to partition the tokens into a set of tokens corresponding to the global concept and one or more sets of tokens corresponding to the one or more local concepts in the input prompt.

. The processing system of, wherein:

. The processing system of, wherein the generative artificial intelligence model includes a gating mechanism configured to route the plurality of sub-prompts to different layers of the generative artificial intelligence model based on the contextual information associated with the tokens in the input prompt.

. The processing system of, wherein the gating mechanism comprises an attention layer in the generative artificial intelligence model, the attention layer comprising:

. The processing system of, wherein the generated response comprises an image depicting one or more objects specified by the input prompt.

. The processing system of, wherein the generative artificial intelligence model comprises a text-to-image diffusion model configured to generate an image output from a textual input.

. A processor-implemented method for machine learning, comprising:

. The method of, wherein:

. The method of, wherein the respective breadth metric comprises an indication of whether the respective token corresponds to a global concept in the input prompt or one or more local concepts in the input prompt.

. The method of, wherein partitioning the tokens in the input prompt comprises partitioning the tokens into a set of tokens corresponding to the global concept and one or more sets of tokens corresponding to the one or more local concepts in the input prompt.

. The method of, wherein:

. The method of, wherein the generative artificial intelligence model includes a gating mechanism configured to route the plurality of sub-prompts to different layers of the generative artificial intelligence model based on the contextual information associated with the tokens in the input prompt.

. The method of, wherein the gating mechanism comprises an attention layer in the generative artificial intelligence model, the attention layer comprising:

. The method of, wherein the generated response comprises an image depicting one or more objects specified by the input prompt.

. The method of, wherein the generative artificial intelligence model comprises a text-to-image diffusion model configured to generate an image output from a textual input.

. A processing system, comprising:

Detailed Description

Complete technical specification and implementation details from the patent document.

Aspects of the present disclosure relate to generative artificial intelligence models.

Generative artificial intelligence models can be used in various environments in order to generate a response to an input prompt (also referred to as a query or an input). For example, generative artificial intelligence models can be used in chatbot applications in which large language models (LLMs) are used to generate an answer, or at least a response, to an input prompt. Other examples in which generative artificial intelligence models can be used include a latent diffusion model, in which a model generates an image or stream of images (e.g., video content) from an input text description of the content of the desired image or stream of images, decision transformers, in which future actions are predicted based on sequences of prior actions within a given environment, or the like. These models may be used, for example, autonomous driving, image capture, and image display applications (e.g., extended reality, augmented reality, and/or virtual reality applications) to generate image outputs used within these applications.

Certain aspects of the present disclosure provide a method for generating responses to large input prompts using a generative artificial intelligence model. The method generally includes receiving an input prompt for processing using the generative artificial intelligence model. The input prompt is partitioned into a plurality of sub-prompts based on contextual information associated with tokens in the input prompt. A response to the input prompt is generated using the generative artificial intelligence model based on the plurality of sub-prompts and the contextual information associated with the tokens in the input prompt. The generated response is output.

Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer-readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those further described herein; and a processing system comprising means for performing the aforementioned methods as well as those further described herein.

The following description and the related drawings set forth in detail certain illustrative features of one or more aspects.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the drawings. It is contemplated that elements and features of one aspect may be beneficially incorporated in other aspects without further recitation.

Aspects of the present disclosure provide apparatus, methods, processing systems, and computer-readable mediums for distributing the processing of large input prompts in generative artificial intelligence models to accurately generate an output reflecting the input prompt.

Generally, generative artificial intelligence models generate a response to a prompt input into the model. For example, generative artificial intelligence models can generate images or other visual content depicting one or more objects specified in an input prompt provided to the generative artificial intelligence model. While generative artificial intelligence models can generate visual content, generative artificial intelligence models may not accurately generate visual content including the requested content in the input prompt. For example, generative artificial intelligence models may tokenize an input prompt into a predefined number of tokens which can be used by the generative artificial intelligence models to generate the output requested by the prompt. While the use of the predefined number of tokens to represent a query may allow for generative artificial intelligence models to generate an output that accurately reflects what was requested in small prompts (e.g., prompts that do not include a large number of objects to render or conditions for rendering), the use of the predefined number of tokens to represent a query may not result in an accurate output for larger prompts. Such degradation may occur, for example, because the entirety of the prompt is processed in each layer of the generative artificial intelligence model and in the same manner during each iteration of processing the prompt through the generative artificial intelligence model.

A large input prompt may request that a generative artificial intelligence model generate, for example, an image or other visual content including a variety of objects and apply a variety of transformations to the visual content. Generally, objects in the generated image may be local concepts. For example, some objects may be located in the foreground of an image, while other objects may be located in the background of the image. Objects may also have spatial relationships with each other which may be specified in the input prompt. In contrast, modifiers specified in the input prompt may be local or global concepts. Some modifiers may apply to specific objects or specific portions of an image (e.g., foreground content, background content, etc.), while other modifiers may be global concepts that apply to the generated image as a whole. Thus, processing the tokens in the large input prompt in the same manner regardless of whether the tokens are associated with local or global concepts or specific timing relationships may cause the outputs generated by generative output models to be inaccurate vis-à-vis the input prompt.

Aspects of the present disclosure provide techniques and apparatus for accurately generating responses to large input prompts by generative artificial intelligence models. To do so, aspects of the present disclosure decompose a prompt into a plurality of sub-prompts which may be processed independently. These sub-prompts may, for example, include tokens which are logically related to each other (e.g., according to contextual information associated with these tokens) so that the generative artificial intelligence model can process these sub-prompts independently (e.g., using different layers of the generative artificial intelligence model, at different times, etc.). By doing so, aspects of the present disclosure may allow for generative artificial intelligence models to accurately generate outputs that reflect what is specified in the input prompt even as the size and complexity of the input prompt increases.

illustrates a generative artificial intelligence modelthat generates responses to a large input prompt using a gating mechanism that partitions the large input prompt into a plurality of sub-parts, according to aspects of the present disclosure. Generally, as discussed, a large input prompt may be an input prompt into a generative artificial intelligence model that specifies an output to be generated according to one or more conditions applicable and one or more objects to be included in the output. As illustrated, the generative artificial intelligence model includes a tokenizer, a large language model, a gating mechanism, and an image generator.

To generate an image from a large input prompt, which is generally a text string specifying the content of an output generated by the generative artificial intelligence model, the tokenizergenerates a set of tokensrepresenting the large input prompt. The set of tokensmay be a one-dimensional array including a plurality of tokens derived from the large input prompt. In some aspects, tokens in the set of tokensmay represent words or portions of words in the large input prompt. Within the one-dimensional array, the ordering of tokens may reflect the ordering of words in the large input prompt, such that a correlation may exist between tokens in the set of tokensand words or portions of words in the input prompt.

The set of tokens may be provided as input into the large language model, which may be an a priori trained model and may be frozen, and the gating mechanism, which may be a learnable machine learning model that adapts to data processed by the generative artificial intelligence model, in order to partition the large input prompt into a plurality of sub-prompts,, and(amongst others, collectively referred to as “sub-prompts”). In some aspects, the gating mechanismmay be configured to generate the sub-promptsbased on a time embeddingor other temporal contextual information identifying a portion of the image generating process which is ongoing in the generative artificial intelligence model. By doing so, the gating mechanismcan generate sub-promptsthat are relevant to generating different objects in the image at each stage of the image generation process. Generally, these different stages may correspond, for example, to different layers of the model implemented by the image generatorand may correspond to different resolutions or receptive fields in an image generated by the image generator.

In some aspects, partitioning the large input prompt represented by the set of tokensmay additionally or alternatively be generated by the gating mechanism based on the output of a large language modeltrained to generate contextual information about the tokens in the set of tokens which can be used as input by the gating mechanism. In some aspects, the large language modelcan generate contextual information for each token in the set of tokens.

The contextual information may, for example, be spatial contextual information identifying an area of the output to be generated by the image generatorin which an object represented by a token is to be located, temporal contextual information identifying temporal dependencies associated with different objects included in the output generated by the image generator, and the like. Spatial contextual information may, for example, indicate whether a token is associated with a local concept or a global concept and thus an area of a latent image (e.g., an image from a previous round of inferencing generated by the generative artificial intelligence model) to be modified by the image generator. Generally, local concepts correspond to objects which involve processing in a portion of the image output generated by the image generatorand may have varying degrees of granularity. For example, local concepts may be organized into foreground and background content. In another example, local concepts may be organized into different spatial areas with relationships to other spatial areas in the image output generated by the image generator. Global concepts correspond to objects or modifications which involve processing the image output in its entirety. For example, global concepts may include a style to be applied by the image generatorto the image output in its entirety, simulations of photographic filters on the image output, or the like.

Temporal contextual information identified by the large language modelmay include information identifying a temporal stage in the inferencing processing at which the image generatoris to process tokens in the set of tokens. Generally, tokens relating to objects that do not have spatial relationships to other objects specified in the input prompt may be associated with temporal contextual information identifying that these tokens can be processed earlier in the inferencing process than other tokens. Tokens relating to objects that do have spatial relationships to other objects specified in the input prompt may be associated with temporal contextual information identifying the objects which the image generatorare to generate prior to processing these tokens. Finally, tokens relating to globally applicable changes to the image generated by the image generatormay be associated with temporal contextual information identifying that these tokens are to be processed at the end of the inferencing process.

The contextual information generated by the large language modelmay be provided as input into the gating mechanism, which as illustrated, decomposes the set of tokensrepresenting the input prompt into a plurality of sub-promptsincluding subsets of the set of tokens. Generally, the gating mechanismdecomposes the set of tokensinto the plurality of sub-promptsbased on the contextual information identified by the large language modelfor the tokens in the set of tokens. In some aspects in which the large language modelgenerates spatial contextual information, the gating mechanismcan generate sub-promptsbased on shared spatial information for tokens in the set of tokensrepresenting the input prompt. For example, the gating mechanismcan generate sub-promptsfor tokens associated with local concepts and tokens associated with global concepts. In another example, the gating mechanismcan generate sub-promptsfor tokens associated with foreground content and tokens associated with background content. In some aspects in which the large language modelgenerates temporal contextual information, the gating mechanismcan generate sub-promptsbased on a stage in the inferencing process at which different objects are to be generated or different modifications are to be applied to the image.

The sub-promptsmay be input into the image generator, along with a Gaussian noise imageand a time embedding, for use in generating an image output of the generative artificial intelligence model. The image generatormay be, for example, a generative artificial intelligence model, such as a text-to-image diffusion model (e.g., a U-Net model), including a plurality of layers. Different layers in the image generatormay be used to generate content in different spatial areas of the image, starting with the Gaussian noise imageand progressively denoising the image to result in an image including the objects specified in the input prompt and in the style specified in the input prompt.

In some aspects, the sub-promptsmay be routed to and processed by different layers in the image generator. Generally, the processing of the sub-promptsby different layers in the image generatorallows for different portions of the image output generated by the image generatorto be processed according to the time embedding, which identifies a step in the inferencing process (e.g., a diffusion step in which the image generatordenoises a latent image to generate an image including the objects and effects specified in the input prompt) in which the image is being processed, and the area to be affected by processing the tokens included in the sub-prompts.

illustrates the architecture of the gating mechanismwhich, as discussed above, partitions a large input prompt into one more sub-prompts, according to aspects of the present disclosure.

As illustrated, the gating mechanismmay be an attention-based neural network which generates sub-prompts as a set of masked tokensfrom a set of tokensrepresenting a tokenized version of the large input prompt provided as input into the generative artificial intelligence modelillustrated in. As illustrated, includes a first layer(also referred to as a first projection layer) configured to project the set of tokensinto query data and a second layer(also referred to as a second projection layer) configured to project the contextual information associated with the inferencing process and/or the set of tokensinto key and value data. As discussed above, the contextual information may include spatial contextual information identifying portions of the image generated by the generative artificial intelligence modelwhich are affected by different tokens in the set of tokensrepresenting the large input prompt and/or temporal contextual information identifying an inferencing stage currently being executed by the generative artificial intelligence modelor temporal relationships between different objects or affects specified in the large input prompt.

The query data generated by the first layerand the key and value data generated by the second layer may be fed into an attention blockfor processing. The attention blockgenerally uses query data generated from the set of tokensand the key and value data generated from the contextual datato determine which tokens are relevant to a specific inferencing round or portion of an image being processed by the generative artificial intelligence model. The output of the attention blockmay be a probability value associated with each token in the set of tokensidentifying a likelihood of those tokens being relevant to a specific inferencing round or portion of an image being processed by the generative artificial intelligence model. The probability values for each token may be processed by a nonlinear layer(e.g., illustrated as a softmax layer, though the use of other nonlinear functions in the nonlinear layermay also be contemplated; for example, the nonlinear layermay alternatively be a sigmoid layer) to generate one or more masksto apply to the set of tokensto generate sub-prompts for processing. The masksmay be combined with the set of tokens(e.g., via a multiplication block) to generate a set of masked tokens. In some aspects, the sum of the values identified in a maskmay be 1, with relevant tokens being associated with higher values and non-relevant tokens being associated with zero or near-zero values. By combining a maskwith the set of tokensusing the multiplication block, the resulting masked tokensmay include a plurality of zero or near-zero values for tokens that are not relevant to a specific sub-prompt (e.g., portion of an input prompt being processed during a given inferencing round in the generative artificial intelligence model) and non-zero values for tokens that are relevant to a specific sub-prompt. In some aspects, the nonlinear layermay include a rounding function which converts probability values above a threshold level (which may be defined a priori) to values of one and probability values below the threshold level to zero so that the resulting masked tokensgenerated by multiplying the set of tokensby the maskincludes in either zero-valued tokens or a token with an identical value to the corresponding token in the set of tokens.

illustrates example operationsfor generating an output to a large input prompt using a generative artificial intelligence model including a gating mechanism that partitions the input prompt into a plurality of sub-parts, according to aspects of the present disclosure. The operationsmay be performed by a device on which a generative artificial intelligence model can be deployed, such as a smartphone, a tablet computer, a laptop computer, a desktop, a server, a cloud compute instance hosted in a distributed computing environment, or the like.

As illustrated, the operationsbegin at blockwith receiving an input prompt for processing using a generative artificial intelligence model.

At block, the operationsproceed with partitioning the input prompt into a plurality of sub-prompts based on contextual information associated with tokens in the input prompt.

In some aspects, the contextual information includes breadth metrics associated with the tokens in the input prompt. To partition the input, a respective breadth metric may be associated with a respective token from the tokens in the input prompt using a language model. The tokens in the input prompt may be partitioned based on respective breadth metrics associated with respective tokens from the tokens in the input prompt. In some aspects, the respective breadth metric comprises an indication of whether the respective token corresponds to a global concept in the input prompt or one or more local concepts in the input prompt. The tokens in the input prompt may be partitioned into a set of tokens corresponding to the global concept and one or more sets of tokens corresponding to the one or more local concepts in the input prompt.

In some aspects, the breadth metrics may corresponding to spatial contextual information discussed above. Local concepts may be, for example, concepts in which a portion of an image that is less than the entirety of the image is to be modified by processing the associated tokens in the generative artificial intelligence model. Global concepts may be, in contrast, concepts involving the modification of the entirety of the image.

In some aspects, the contextual information may include temporal embeddings associated with the tokens in the input prompt. The input prompt may be partitioned into the plurality of sub-prompts by partitioning the tokens in the input prompt into groups of temporally related tokens based on the temporal embeddings.

In some aspects, the contextual information may include temporal embeddings associated with the output generation process. Generally, these temporal embeddings may correspond to a step in the output generation process which is currently being executed.

At block, the operationsproceed with generating a response to the input prompt using the generative artificial intelligence model based on the plurality of sub-prompts and the contextual information associated with the tokens in the input prompt.

In some aspects, the generative artificial intelligence model includes a gating mechanism configured to route the plurality of sub-prompts to different layers of the generative artificial intelligence model based on the contextual information associated with the tokens in the input prompt. The gating mechanism may be, for example, an attention layer or other attention-based neural network. The gating mechanism generally includes a first projection block that projects the contextual information to key and value data; a second projection block that projects the tokens in the input prompt to query data; a multi-head attention block that generates an attention output based on the key data, the value data, and the query data; and a nonlinear projection layer that generates an attention mask based on the attention output, the attention mask being combined with the tokens in the input prompt to generate a masked set of tokens as an output of the gating mechanism.

In some aspects, the generative artificial intelligence model comprises a text-to-image diffusion model configured to generate an image output from a textual input.

At block, the operationsproceed with outputting the generated response.

In some aspects, the generated response may be an image depicting one or more objects specified by the input prompt.

depicts an example processing systemfor processing large input prompts using generative artificial intelligence model, such as described herein for example with respect to.

The processing systemincludes a central processing unit (CPU), which in some examples may be a multi-core CPU. Instructions executed at the CPUmay be loaded, for example, from a program memory associated with the CPUor may be loaded from a memory partition (e.g., of a memory).

The processing systemalso includes additional processing components tailored to specific functions, such as a graphics processing unit (GPU), a digital signal processor (DSP), a neural processing unit (NPU), and a connectivity component.

An NPU, such as the NPU, is generally a specialized circuit configured for implementing control and arithmetic logic for executing machine learning algorithms, such as algorithms for processing artificial neural networks (ANNs), deep neural networks (DNNs), random forests (RFs), and the like. An NPU may sometimes alternatively be referred to as a neural signal processor (NSP), tensor processing unit (TPU), neural network processor (NNP), intelligence processing unit (IPU), vision processing unit (VPU), or graph processing unit.

NPUs, such as the NPU, are configured to accelerate the performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, a plurality of NPUs may be instantiated on a single chip, such as a system on a chip (SoC), while in other examples such NPUs may be part of a dedicated neural-network accelerator.

NPUs may be optimized for training or inference, or in some cases configured to balance performance between both. For NPUs that are capable of performing both training and inference, the two tasks may still generally be performed independently.

NPUs designed to accelerate training are generally configured to accelerate the optimization of new models, which is a highly compute-intensive operation that involves inputting an existing dataset (often labeled or tagged), iterating over the dataset, and then adjusting model parameters, such as weights and biases, in order to improve model performance. Generally, optimizing based on a wrong prediction involves propagating back through the layers of the model and determining gradients to reduce the prediction error.

NPUs designed to accelerate inference are generally configured to operate on complete models. Such NPUs may thus be configured to input a new piece of data and rapidly process this new piece through an already trained model to generate a model output (e.g., an inference).

In some implementations, the NPUis a part of one or more of the CPU, the GPU, and/or the DSP. These may be located on a user equipment (UE) in a wireless communication system or another computing device.

In some examples, the connectivity componentmay include subcomponents, for example, for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., Long-Term Evolution (LTE)), fifth generation (5G) connectivity (e.g., New Radio (NR)), Wi-Fi connectivity, Bluetooth connectivity, and other wireless data transmission standards. The connectivity componentmay be further coupled to one or more antennas.

The processing systemmay also include one or more sensor processing unitsassociated with any manner of sensor, one or more image signal processors (ISPs)associated with any manner of image sensor, and/or a navigation processor, which may include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.

The processing systemmay also include one or more input and/or output devices, such as screens, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and the like.

In some examples, one or more of the processors of the processing systemmay be based on an ARM or RISC-V instruction set.

The processing systemalso includes the memory, which is representative of one or more static and/or dynamic memories, such as a dynamic random access memory, a flash-based static memory, and the like. In this example, the memoryincludes computer-executable components, which may be executed by one or more of the aforementioned processors of the processing system.

In particular, in this example, the memoryincludes a prompt receiving componentA, a prompt partitioning componentB, a response generating componentC, a response outputting componentD, and a generative modelE. The depicted components, and others not depicted, may be configured to perform various aspects of the methods described herein.

Generally, the processing systemand/or components thereof may be configured to perform the methods described herein.

Patent Metadata

Filing Date

Unknown

Publication Date

November 13, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search