Patentable/Patents/US-20260073484-A1

US-20260073484-A1

Generating a Digital Video Utilizing a Set of Anchor Tokens to Denoise Noised Tokens

PublishedMarch 12, 2026

Assigneenot available in USPTO data we have

InventorsTobias Hinz Lior Shapira Lakshya Lnu Kevin Duarte Ali Aminian

Technical Abstract

The present disclosure relates to systems, methods, and non-transitory computer-readable media that generate a digital video based on denoised tokens. In particular, the disclosed systems generate a set of image tokens from a digital image that is part of an image-to-video request. Furthermore, the disclosed systems generate a set of anchor tokens from the set of image tokens by adding a timestep embedding to the set of image tokens that indicates that the set of anchor tokens are fully denoised. Further, the disclosed systems generate combined tokens from the set of anchor tokens and noised tokens that are generated from noise. Moreover, the disclosed systems generate denoised tokens by using a diffusion transformer model to process the combined tokens. Further, from the denoised tokens, the disclosed systems generate the digital video that includes the digital image.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

generating, from an image-to-video request comprising a digital image, a set of image tokens from the digital image; generating a set of anchor tokens from the set of image tokens by adding a timestep embedding to the set of image tokens that indicates that the set of anchor tokens are fully denoised; generating combined tokens from the set of anchor tokens and noised tokens that are generated from noise; generating, utilizing a diffusion transformer model to process the combined tokens, denoised tokens; and generating a digital video comprising at least a portion of the digital image based on the denoised tokens. . A computer-implemented method comprising:

claim 1 receiving, from a client device at inference time, the image-to-video request to generate the digital video comprising the digital image and the image-to-video request indicates that the digital image is to be portrayed in the digital video, wherein the image-to-video request indicates that the digital image is to be included as a first frame of a sequence of frames, an intermediate frame of the sequence of frames, or a final frame of the sequence of frames. . The computer-implemented method of, further comprising:

claim 1 generating, from the digital image of the image-to-video request, an embedding that represents the digital image; and generating, utilizing a tokenization model to break down the digital image into a plurality of image patches, the set of image tokens from the embedding. . The computer-implemented method of, wherein generating the set of image tokens comprises:

claim 1 initializing the noised tokens by sampling a random level of noise from a noise distribution; and removing noise from the noised tokens according to the set of anchor tokens utilizing the diffusion transformer model. . The computer-implemented method of, wherein generating the denoised tokens comprises:

claim 1 . The computer-implemented method of, wherein generating the digital video comprises generating a sequence of frames from the denoised tokens, wherein the digital video includes the digital image as at least one of a portion of a frame of the sequence of frames, one or more keyframes in the sequence of frames, or one or more motion frames in the sequence of frames.

claim 1 generating, from a frame of a sequence of training frames, a training embedding that represents the frame; generating, utilizing a tokenization model, a set of training tokens from the training embedding; and generating, utilizing the tokenization model, noised training tokens from the sequence of training frames that does not include the frame. . The computer-implemented method of, further comprising training the diffusion transformer model by:

claim 6 generating training anchor tokens by concatenating timestep embeddings to the set of training tokens to indicate that the set of training tokens are fully denoised; and generating, utilizing the diffusion transformer model to process the training anchor tokens and the noised training tokens, denoised training tokens. . The computer-implemented method of, further comprising:

claim 7 generating, utilizing a detokenization model, denoised training embeddings from the denoised training tokens; and comparing the denoised training embeddings with embeddings generated from the sequence of training frames prior to tokenization; and determining a measure of loss from comparing the denoised training embeddings with the embeddings generated from the sequence of training frames prior to tokenization to modify parameters of the diffusion transformer model. . The computer-implemented method of, further comprising:

claim 1 generating, from a first pass of the combined tokens through the diffusion transformer model, a conditional token output from the denoised tokens; generating, from a second pass of additional combined tokens through the diffusion transformer model, an unconditional token output from additional denoised tokens; generating a final token output by combining the conditional token output and the unconditional token output; and generating the digital video comprising at least the portion of the digital image based on the final token output. . The computer-implemented method of, wherein generating the digital video comprises:

a memory component; and generating a set of image tokens from a digital image as part of an image-to-video request; generating, from a first pass of combined tokens through a trained diffusion transformer model, a conditional token output, wherein the combined tokens comprise a set of anchor tokens from the set of image tokens and noised tokens; generating, from a second pass of additional combined tokens through the trained diffusion transformer model, an unconditional token output, wherein the additional combined tokens comprise the set of anchor tokens and additional noised tokens; generating a final token output by combining the conditional token output and the unconditional token output; and generating a digital video comprising at least a portion of the digital image based on the final token output. one or more processing devices coupled to the memory component, the one or more processing devices to perform operations comprising: . A system comprising:

claim 10 . The system of, wherein the operations comprise receiving, from a client device at inference time, the image-to-video request to generate the digital video and a text prompt that indicates that the digital image is to be portrayed in the digital video.

claim 10 receiving, from a client device, the image-to-video request that comprises a conditional prompt for the trained diffusion transformer model to include in the digital video and an unconditional prompt for the digital video, wherein the conditional prompt indicates that the digital image is to be portrayed as at least one of an initial frame, an intermediate frame, or a subset of frames in the digital video. . The system of, wherein the operations comprise:

claim 10 generating text tokens from a conditional prompt of a text prompt of the image-to-video request; generating an image embedding from the digital image; initializing the noised tokens from a noise distribution; and combining the text tokens, the image embedding, and the noised tokens. . The system of, wherein generating the combined tokens comprises:

claim 13 generating, utilizing a tokenization model to break down the digital image into a plurality of image patches, the set of image tokens from the image embedding; and adding a timestep embedding to the set of image tokens from the image embedding, wherein the timestep embedding indicates to the trained diffusion transformer model that the set of image tokens are fully denoised. . The system of, further comprising generating the set of anchor tokens from the image embedding by:

claim 10 generating text tokens from an unconditional prompt of a text prompt of the image-to-video request; generating an image embedding from the digital image; initializing the noised tokens from a noise distribution; and combining the text tokens, the image embedding, and the noised tokens. . The system of, wherein generating the additional combined tokens comprises:

claim 10 interpolating, utilizing a classifier free guidance model, between the conditional token output and the unconditional token output, wherein interpolating comprises a guidance scale that encourages the trained diffusion transformer model to generate the digital video based on the conditional token output; and generating the final token output based on the interpolation of the classifier free guidance model. . The system of, wherein combining the conditional token output and the unconditional token output comprises:

generating, from an image-to-video request comprising a digital image, a set of image tokens from the digital image; generating a set of anchor tokens from the set of image tokens by adding a timestep embedding to the set of image tokens that indicates that the set of image tokens are fully denoised; generating combined tokens from the set of anchor tokens and noised tokens that are generated from noise; generating, utilizing a diffusion transformer model to process the combined tokens, denoised tokens; and generating a digital video comprising at least a portion of the digital image based on the denoised tokens. . A non-transitory computer-readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to perform operations comprising:

claim 17 receiving, from a client device at inference time, the image-to-video request to generate the digital video comprising the digital image, wherein the image-to-video request indicates that the digital image is to be included as at least one of a portion of a frame of a sequence of frames, one or more keyframes in the sequence of frames, or one or more motion frames in the sequence of frames. . The non-transitory computer-readable medium of, wherein the operations further comprise:

claim 17 . The non-transitory computer-readable medium of, wherein the operations further comprise training the diffusion transformer model by adding noise to a subset of frames of a sequence of frames of a training video, wherein the subset of frames does not include one or more frames that are anchor frames in the training video.

claim 17 generating, from a first pass of the combined tokens through the diffusion transformer model, a conditional token output from the denoised tokens, wherein the combined tokens comprise the set of anchor tokens and the noised tokens; generating, from a second pass of additional combined tokens through the diffusion transformer model, an unconditional token output from additional denoised tokens, wherein the additional combined tokens comprise the set of anchor tokens and additional noised tokens; generating a final token output by combining the conditional token output and the unconditional token output; and generating the digital video comprising at least the portion of the digital image based on the final token output. . The non-transitory computer-readable medium of, wherein generating the digital video comprises:

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims the benefit of and priority to U.S. Provisional Application No. 63/693,660, filed Sep. 11, 2024. The aforementioned application is hereby incorporated by reference in its entirety.

Recent years have seen significant advancement in hardware and software platforms for performing generative tasks. Indeed, systems provide a variety of ways to generate static images and dynamic videos. For instance, systems create distinct architectures for generating content in different modalities. Specifically, systems tailor architecture for creating digital videos from various input prompts.

One or more embodiments described herein provide benefits and/or solve one or more problems in the art with systems, methods, and non-transitory computer-readable media that implement an artificial intelligence (hereinafter referred to as AI) architecture for executing image-to-video requests. For example, the disclosed systems generate a set of image tokens for a digital image part of an image-to-video request and further generates a set of anchor tokens from the set of image tokens by adding a timestep that indicates that the set of image tokens are fully denoised. Furthermore, in some embodiments, the disclosed systems generate denoised tokens by using a diffusion transformer model to process the set of anchor tokens and noised tokens. Moreover, in some embodiments, the disclosed systems generate a digital video based on the denoised tokens and the digital video includes at least a portion of the digital image that was part of the image-to-video request.

Additional features and advantages of one or more embodiments of the present disclosure are outlined in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments.

One or more embodiments described herein includes a frame anchoring technique for generating high-quality and accurate digital videos based on an image-to-video request. Existing systems that perform generative tasks suffer from a variety of issues related to accuracy, efficiency, and operational flexibility. Specifically, existing systems suffer from computational inaccuracies. For example, existing systems perform video generation, however, when performing generative tasks, existing systems fail to accurately include content within a generated digital video that is specified in a prompt. For instance, existing systems utilize architecture that fails to fully consider information specified in a video generation request. Thus, existing systems utilize generative architecture that generates inaccurate digital videos.

Existing systems generate video from a user-provided prompt, however existing systems suffer from generating content that does not have a strong image/video and/or text semantic alignment (e.g., existing systems generate inaccurate digital videos). Furthermore, existing systems often suffer from generating low-quality frames and/or compromised frames that fail to capture the subject of the request.

In addition, existing systems typically suffer from accurately generating a video from an input image. For example, existing systems generate videos based on an input image but fail to include the input image as a coherent frame within the video. As such, existing systems suffer from inaccurate and lower-quality videos in performing image-to-video tasks.

Related to accuracy issues, existing systems suffer from computational inefficiencies. Specifically, existing systems typically require prompting and re-prompting of the system to generate a satisfactory digital video. In some instances, even with the prompting and re-prompting of existing systems, existing systems fail to accurately generate digital videos. As such, existing systems consume excessive computational resources and time to perform generative tasks.

Moreover, conventional systems suffer from further inefficiencies by using complicated transformer-based architectures. Specifically, in order for conventional systems to generate video and image content, conventional systems typically require domain specific complexity for the model architecture to capture all the domain specific data. Accordingly, conventional systems require a lot of time and resources to run a model that generates content across domains. Furthermore, related to the accuracy and computational efficiency issues, existing systems suffer from operational inflexibilities. Specifically, existing systems suffer from implementing rigid generative models that inaccurately and inefficiently generate digital videos.

In some embodiments, a generative AI digital visual system overcomes disadvantages of existing systems. In one or more embodiments, the generative AI digital visual system implements a frame anchoring technique for generating digital videos by transforming a digital image (e.g., part of an image-to-video request) into a set of image tokens and further into a set of anchor tokens. For instance, the generative AI digital visual system transforms the set of image tokens into a set of anchor tokens by adding a timestep embedding that indicates that the set of image tokens are fully denoised.

102 Accordingly, the generative AI digital visual system uses a diffusion transformer model to process the set of anchor tokens (which are not noised) and noised tokens to generate denoised tokens. In particular, the set of anchor tokens act as a guide in removing noise from the noised tokens, such that the generative AI digital visual systemgenerates a digital video that contains the digital image as a frame. In other words, the generative AI digital visual system leverages the set of anchor tokens for a diffusion transformer model to fully use the information present in the digital image (e.g., the condition or anchor image) to remove noise from noised tokens.

102 102 102 102 In one or more embodiments, the generative AI digital visual systemuses the set of anchor tokens to support an initial frame of the generated digital video. Specifically, the generative AI digital visual systemuses the set of anchor tokens to ensure that the digital image ends up as the initial frame of a generated digital video. In some embodiments, the generative AI digital visual systemuses the set of anchor tokens to support a final frame, an intermediate frame, any subset of frames, or any portion of frames within the generated digital video. In particular, in some embodiments, the generative AI digital visual systemuses the set of anchor tokens to ensure that keyframes or motion frames of a generated digital video are the digital image provided as part of the image-to-video request.

102 In one or more embodiments, the generative AI digital visual systemtrains a diffusion transformer model using the frame anchoring technique. Specifically, the generative AI digital visual system receives a training digital video that includes a sequence of frames. For instance, the generative AI digital visual system transforms the sequence of frames into image tokens and adds noise to image tokens of the sequence of frames except for a frame that is used to condition the diffusion transformer model (e.g., an anchor frame or condition image) to improve image to video generation during training and inference time. Accordingly, at inference time, the generative AI digital visual system demonstrates improved generative capabilities for generating digital visual content using artificial intelligence systems.

In one or more embodiments, at inference time, the generative AI digital visual system uses a trained diffusion transformer model to perform high-quality and accurate image-to-video generation. Specifically, once the diffusion transformer model is trained, the generative AI digital visual system receives an image-to-video request and performs multiple passes over a diffusion transformer model (e.g., a first pass and a second pass). For instance, the generative AI digital visual system generates a conditional token output (e.g., an output that is based on the conditional aspects of the image-to-video request) and an unconditional token output (e.g., an output that is based on unconditional aspects of the image-to-video request). Furthermore, the generative AI digital visual system combines the conditional token output and the unconditional token output to generate a final token output and further generates the digital video from the final token output.

As mentioned above, the generative AI digital visual system overcomes deficiencies of exiting systems. For example, the generative AI digital visual system improves computational accuracy relative to existing systems. As mentioned above, existing systems suffer from inaccurately including content within a generated digital video. In contrast, the generative AI digital visual system generates a set of anchor tokens (from a set of image tokens of a digital image) to use as a guide in removing noise from denoised tokens. In doing so, the generative AI digital visual system more accurately includes content specified in an image-to-video request. For instance, by keeping the set of image tokens (for the digital image, i.e., the anchor image) fully denoised, the generative AI digital visual system uses a diffusion transformer model to fully consider the information present in the digital image (e.g., the condition or anchor image) to remove noise from noised tokens.

Moreover, in contrast to existing systems which lack a strong image/video and/or text semantic alignment with an image-to-video request, in some embodiments, the generative AI digital visual system improves semantic alignment by identifying the digital image (e.g., anchor image) to generate a set of anchor tokens from and further uses the set of anchor tokens to generate the denoised tokens. In other words, the generative AI digital visual system more accurately considers an image-to-video prompt to generate a higher quality digital video and higher quality frames within the generated digital video.

Furthermore, in contrast to existing systems which typically fail to include an input image (part of a visual prompt) as a coherent frame within a generate digital video, in some embodiments, the generative AI digital visual system receives an image-to-video request that includes a digital image (e.g., the digital image is indicated as a condition of generating the digital video) and generates a set of anchor tokens from the digital image. Specifically, as mentioned, the generative AI digital visual system uses the set of anchor tokens as a guide to remove noise from noised tokens and to create a digital video that includes the digital image as a coherent frame.

Moreover, in some embodiments, the generative AI digital visual system further improves upon accuracy of existing systems by implementing unique training measures. Specifically, the generative AI digital visual system uses a training video that includes a sequence of frames and adds noise to image tokens of the sequence of frames except for image tokens of an anchor frame (e.g., a digital image). In doing so, the generative AI digital visual system optimizes parameters of a diffusion transformer model to learn to remove noise from noised tokens according to a set of anchor tokens. Thus, the generative AI digital visual system improves upon accuracy of existing systems by implementing the frame anchoring technique during training.

Furthermore, in some embodiments, the generative AI digital visual system improves upon accuracy by using spatial-temporal positional encodings. Specifically, the generative AI digital visual system adds spatial-temporal positional encodings to noised tokens to further guide a diffusion transformer model in removing noise from noised tokens.

In one or more embodiments, the generative AI digital visual system improves efficiency relative to existing systems. For example, existing systems suffer from requiring prompting and re-prompting to generate a satisfactory digital video. In contrast, the generative AI digital visual system conserves computational resources by generating a digital video that accurately conforms with an image-to-video request, thus reducing the number of prompts and re-prompts from a client device.

In addition, in one or more embodiments, the generative AI digital visual system utilizes a single stream diffusion transformer model that simplifies the complexity and streamlines the efficiency of generating digital videos from digital images. Specifically, the generative AI digital visual system feeds input data that includes the set of anchor tokens through the diffusion transformer model (without additional modulation layers or adaptive layer normalization layers, hereinafter referred to as adaLN layers) and the diffusion transformer model considers the set of anchor tokens as a guide in removing noise from noised tokens. As such, in one or more embodiments, the generative AI digital visual system reduces the time and resources needed to generate digital video from a digital image.

As also mentioned above, in one or more embodiments, the generative AI digital visual system improves upon operational flexibility relative to existing systems. For example, the generative AI digital visual system provides dynamic flexibility to generate a diverse range of digital videos. For instance, the generative AI digital visual system allows for an image-to-video request that accurately includes a digital image as one or more frames in a generated digital video. Furthermore, the generative AI digital visual system improves flexibility by allowing the image-to-video request to specify whether a visual prompt (e.g., a digital image) is to be included as an initial frame, a final frame, an intermediate frame, a portion of a frame, one or more keyframes, and/or one or more motion frames.

1 FIG. 1 FIG. 1 FIG. 100 102 100 104 106 108 110 106 102 102 105 Additional details regarding the generative AI digital visual system will now be provided with reference to the figures. For example,illustrates a schematic diagram of an exemplary system environmentin which a generative AI digital visual systemoperates. As illustrated in, the system environmentincludes server(s), a digital image system, a network, and a client device. Additionally,illustrates that the digital image systemincludes the generative AI digital visual systemand the generative AI digital visual systemfurther includes a frame anchoring system.

100 100 102 108 104 108 110 1 FIG. 1 FIG. Although the system environmentofis depicted as having a particular number of components, the system environmentis capable of having a different number of additional or alternative components (e.g., a different number of servers, client devices, or other components in communication with the generative AI digital visual systemvia the network). Similarly, althoughillustrates a particular arrangement of the server(s), the network, and the client device, various additional arrangements are possible.

104 108 110 108 104 110 15 FIG. 15 FIG. The server(s), the network, and the client deviceare communicatively coupled with each other either directly or indirectly (e.g., through the networkdiscussed in greater detail below in relation to). Moreover, the server(s)and the client deviceinclude one or more of a variety of computing devices (including one or more computing devices as discussed in greater detail in relation to).

100 104 104 104 104 As mentioned above, the system environmentincludes the server(s). In one or more embodiments, the server(s)process input for an image-to-video request or for training one or more artificial intelligence models or generating a video from an image-to-video request. In one or more embodiments, the server(s)comprise a data server. In some implementations, the server(s)comprise a communication server or a web-hosting server.

110 102 102 107 105 In one or more embodiments, the client deviceincludes computing devices associated with the one or more user accounts that submit image-to-video requests (e.g., media generation requests) for the generative AI digital visual systemto generate media (e.g., based on a text prompt and/or a visual prompt). For instance, the generative AI digital visual systemtrains one or more models (e.g., a diffusion transformer modelpart of the frame anchoring system) from data by using a frame anchoring technique.

110 110 112 106 104 110 In one or more embodiments, the client deviceincludes smartphones, tablets, desktop computers, laptop computers, head-mounted-display devices, or other electronic devices. The client deviceincludes one or more software applications (e.g., the digital image applicationincludes a digital image editing application) for generating content in accordance with the digital image system. In one or more embodiments, the digital image application includes a software application hosted on the server(s)accessible by the client devicethrough another application, such as a web browser.

102 104 102 110 106 104 102 105 107 110 110 102 104 102 104 110 102 110 To provide an example implementation, in one or more embodiments, generative AI digital visual systemon the server(s)supports the generative AI digital visual systemon the client device. For instance, in some cases, the digital image systemon the server(s)trains the generative AI digital visual system(e.g., trains generative models associated with the frame anchoring system, such as the diffusion transformer model) to provide to the client devicefor implementation. In one or more embodiments, the client deviceobtains (e.g., downloads) the generative AI digital visual systemtrained on the server(s)for implementation. Once downloaded, the generative AI digital visual system(e.g., which was trained on the server(s)) on the client deviceprovides tools for indicating instructions to the generative AI digital visual systemto create media (e.g., generate digital videos that include a frame provided by the client device).

102 110 104 110 102 102 110 104 102 104 In alternative implementations, the generative AI digital visual systemincludes a web hosting application that allows the client deviceto interact with content and services hosted on the server(s). In other words, the client deviceinteracts with the generative AI digital visual systemwithout downloading the generative AI digital visual system. To illustrate, in one or more implementations, the client deviceaccess a software application supported by the server(s). In response, the generative AI digital visual systemon the server(s)provides tools for inputting instructions to generate digital visual content (e.g., a video with video captions and images).

102 105 102 102 Furthermore, in some implementations, the generative AI digital visual systemtrains one or more artificial intelligence models by using a diffusion transformer model to generate training embeddings and further utilizes the training embeddings to optimize parameters of a diffusion transformer model (e.g., a diffusion transformer model implemented by the frame anchoring system). Moreover, in one or more embodiments, the generative AI digital visual systemfurther generates improved positional encodings that capture spatial and temporal information for image patches in a frame of a sequence of frames and uses the improved positional encodings at inference time and training time (e.g., as data to guide the removal of noise). For instance, the generative AI digital visual systemleverages the positional encodings to further improve/optimize the parameters of a diffusion transformer model.

102 100 102 104 102 100 102 104 110 102 102 1 FIG. 1 FIG. 10 FIG. Indeed, in one or more embodiments, the generative AI digital visual systemis implemented in whole, or in part, by the individual elements of the system environment. For instance, althoughillustrates the generative AI digital visual systemimplemented or hosted on the server(s), different components of the generative AI digital visual systemare able to be implemented by a variety of devices within the system environment. For example, one or more (or all) components of the generative AI digital visual systemare implemented by a different computing device or a separate server from the server(s). Indeed, as shown in, the client deviceincludes the generative AI digital visual system. Example components of the generative AI digital visual systemwill be described below with regard to.

102 102 200 202 2 FIG. 2 FIG. As mentioned above, in certain embodiments, the generative AI digital visual systemgenerates a digital video that includes a digital image from an image-to-video request.illustrates an overview diagram of the generative AI digital visual systemutilizing a diffusion transformer model to generate a digital video from a digital image in accordance with one or more embodiments. For example,shows an image-to-video requestthat includes a digital image.

200 102 102 200 102 200 200 102 202 200 In one or more embodiments, the image-to-video requestrefers to the generative AI digital visual systemreceiving a request to generate a digital video. Specifically, the generative AI digital visual systemreceives the image-to-video requestin the form of a prompt from a client device to generate media that conforms with the prompt. For instance, the generative AI digital visual systemreceives the image-to-video requestas a visual prompt (e.g., a digital image) and/or a text prompt. To illustrate, the image-to-video requestincludes specific parameters for the generative AI digital visual system, such as creating a digital video based on a provided digital image (e.g., a visual prompt with the digital image). Further, the image-to-video requestoptionally includes a text prompt to generate a digital video where the text prompt specifies conditions (e.g., a conditional prompt), and unconditional prompts (e.g., flexible settings to include in the generated digital video), a format, the subject matter of the media, a style of the media, a mood or theme, and any additional details (e.g., aspect ratio, frames per second, shot size, camera angle, a type of motion such as zooming in or zooming out, etc.).

102 202 202 102 As mentioned, the image-to-video request includes a visual prompt. In one or more embodiments, a visual prompt refers to a visual input to guide the generative AI digital visual systemto generate media. For example, the visual prompt includes the digital image. Further, in some instances, the visual prompt further includes a text prompt along with the digital image. To illustrate, the generative AI digital visual systemreceives the visual prompt that includes an image and a text prompt describing the media to be generated.

202 202 202 202 In one or more embodiments, the digital imageincludes various pictorial elements. In particular, the pictorial elements include pixel values that define the spatial and visual aspects of the digital image such as text and image objects. For example, the digital imageis a rasterized image which includes a grid of pixels. In particular, the rasterized image includes a fixed resolution as determined by a number of pixels within the digital image. Furthermore, in some embodiments, the digital imageacts as an anchor frame or a condition frame for generating a digital video.

2 FIG. 102 204 206 200 208 208 202 102 102 As shown in, the generative AI digital visual systemuses a diffusion transformer model(e.g., that in some embodiments includes a transformer block) to process the image-to-video requestto generate a digital video. Specifically, the digital videoincludes the digital imageas one of the frames. In one or more embodiments, the generative AI digital visual systemgenerates a digital video. As is discussed in more detail below, in some embodiments, the generative AI digital visual systemutilizes a digital video to train one or more models.

208 208 208 208 In one or more embodiments, the digital videorefers to a form of media that is encoded and stored in a digital format. Specifically, the digital videoincludes a sequence of frames (e.g., images, keyframes, and/or motion frames) and each frame of the sequence of frames is displayed sequentially. For instance, the digital videoincludes a specific resolution (480p, 720p, 1080p, 4K, 8K, etc.) which refers to a specific number of pixels being displayed (e.g., a video's resolution defines the clarity and sharpness of the digital video). Further, the digital videoincludes a frame rate (e.g., a number of frames shown per second in a video e.g., 24 fps, 30 fps, etc.), an aspect ratio (e.g., the width and height dimensions of a frame, such as 16:9 or 4:3), compression (e.g., a file size of the digital video), and audio that goes along with the digital video (e.g., audio files that are synchronized with frames of the digital video).

208 In one or more embodiments, the digital videoincludes a sequence of frames. For example, a sequence of frames refers to multiple still images that are displayed in succession to create a perception of motion. Specifically, each frame of a sequence of frames represents a single moment in time and when the sequence of frames is played together, the sequence of frames produces continuous motion and creates the content of the video. In other words, the sequence of frames includes temporal continuity where each frame in the sequence represents a next moment in time and simulates motion when moving from one frame to the next.

208 208 102 In one or more embodiments, the digital videoincludes an image frame. For example, the image frame refers to a static image that represents content of the digital video. Specifically, in one or more embodiments, the generative AI digital visual systemtreats a first frame (e.g., frame zero) of a sequence of frames as the image frame. In other words, the image frame refers to a first visual element displayed at the start of the video (e.g., a static image beginning of the video).

102 In one or more embodiments, a keyframe refers to an image frame that stores visual data for a beginning or an ending of an action or a position of an object or character. Specifically, a video includes multiple keyframes. In other words, the generative AI digital visual systemutilizes keyframes as complete image frames that serve as anchor points for motion. To illustrate, a video includes a sequence of frames, and the sequence of frames includes a keyframe every 16 frames.

208 102 102 102 208 In one or more embodiments, the digital videoincludes at least one motion frame. For example, the generative AI digital visual systemutilizes motion frames as intermediate frames between keyframes to store changes or differences from a previous frame. Specifically, the generative AI digital visual systemutilizes the motion frames to store information related to changes between successive frames such as a change in position or color of an object from one frame to the next. Further, the generative AI digital visual systemutilizes the motion frames at playtime of the digital videoin tandem with the keyframes to create a perception of smooth motion from one keyframe to the next keyframe.

102 102 102 306 300 302 300 3 3 FIGS.A-B 3 3 FIGS.A-B As mentioned above, the generative AI digital visual systemgenerates a set of anchor tokens.illustrates the generative AI digital visual systemgenerating a set of anchor tokens and text tokens for a conditional prompt and an unconditional prompt in accordance with one or more embodiments. For example,show the generative AI digital visual systemgenerating an image embeddingfrom a digital imageusing an image encoder(e.g., an encoder of a dual-VAE model mentioned below). Specifically, the digital imageacts as an anchor digital image/conditioning digital image that is part of an image-to-video request.

302 302 300 302 300 300 102 306 In one or more embodiments, the image encoderis a neural network (or one or more layers of a neural network) that extract features relating to digital images. In some cases, the image encoderrefers to a neural network that both extracts and encodes features from the digital image. For example, the image encoderincludes a particular number of layers including one or more fully connected and/or partially connected layers of neurons that extract image patches from the digital imageand encode localized features of the digital image. To illustrate, in one or more embodiments, the generative AI digital visual systemgenerates the image embeddingthat represents a complete frame of a digital image.

102 302 306 300 300 In one or more embodiments, the generative AI digital visual systemutilizes the image encoderto generate an embedding (e.g., the image embedding). In some embodiments, the embedding includes a numerical representation (e.g., a vector) of a digital image. For instance, the embedding captures features and properties of the digital image. To illustrate, the embedding includes semantic information such as the presence of objects, shapes, and spatial relationships.

3 3 FIGS.A-B 102 308 310 102 Moreover,illustrates the generative AI digital visual systemutilizing a tokenization modelto generate a set of image tokens. In one or more embodiments, the generative AI digital visual systemtransforms the embedding (e.g., image embedding) into image tokens (e.g., visual tokens).

102 308 306 308 102 102 For example, the generative AI digital visual systemutilizes the tokenization modelto patchify the image embedding. Specifically, the tokenization modelconverts the embedding into smaller patches or grids that are treated as individual tokens for further processing (e.g., adding noise and then denoising). For instance, the generative AI digital visual systemutilizes patchification to handle high-dimensional image data efficiently. To illustrate, the generative AI digital visual systemflattens each patch of the embedding (e.g., into a single dimension vector), converts the flattened patch into a lower-dimensional representation, and maps the flattened lower-dimensional patch into a fixed-length feature vector.

102 102 Accordingly, the generative AI digital visual systemtreats the flattened fixed-length feature vector as an image token and utilizes the diffusion transformer model to process the image token. Moreover, in some embodiments, the generative AI digital visual systemadds positional encodings to each patch (e.g., image token) to encode spatial information about where the patch belongs in a digital image.

102 300 102 102 300 300 102 300 In one or more embodiments, the generative AI digital visual systemselects a set of image patches from the digital image. In particular, the generative AI digital visual systemgenerates the set of image patches by sub-dividing a digital image into smaller regions. For instance, the generative AI digital visual systemsub-divides the digital imageinto patches based on a predetermined resolution (e.g., 256×256), where each patch represents localized regions within the digital image. In some embodiments, an image patch of the set of image patches does not share any pixel values with other image patches. In some embodiments, an image patch of the set of image patches overlaps with pixel values of an adjacent image patch. Accordingly, in one or more embodiments, the generative AI digital visual systemsub-divides the digital imageinto image patches where some of the image patches do not overlap with pixel values of other image patches and some of the image patches do overlap with pixel values of other image patches.

102 308 310 310 300 102 308 300 102 310 As mentioned above, the generative AI digital visual systemutilizes the tokenization modelto generate image tokens. In one or more embodiments, the set of image tokensrefers to image tokens from an embedding for a digital image (e.g., a digital image of a visual prompt). For instance, an image token of the set of image tokensis from an image patch of the digital image. In other words, the generative AI digital visual systemutilizes the tokenization modelto break down the digital imageinto image patches and further transforms each image patch into an image token. Specifically, the generative AI digital visual systemgenerates the set of image tokensto use as anchor tokens in the denoising process.

3 3 FIGS.A-B 102 313 310 102 313 310 313 102 313 313 102 102 313 102 As shown inthe generative AI digital visual systemadds a timestep embeddingto the set of image tokens. In one or more embodiments, the generative AI digital visual systemadds the timestep embeddingto tokens (e.g., the set of image tokens). For example, the timestep embeddingrefers to an embedding that represents a specific amount of noise added to a token/set of tokens at a specific timestep. In other words, the generative AI digital visual systemgenerates the timestep embeddingcorresponding to a first transformer block, a second timestep embedding corresponding to a second transformer block, and a third timestep embedding corresponding to a third transformer block. For instance, the timestep embeddingindicates a specific timestep in which noise was added to the noised tokens such that the generative AI digital visual systemdetermines how much noise to remove from a token at a specific transformer block. Moreover, in some embodiments, the generative AI digital visual systemadds the timestep embeddingto a set of tokens where the timestep embedding indicates that the set of tokens is fully denoised. In other words, in some instances, the timestep embedding indicates to the generative AI digital visual systemto not perform any denoising to a specific set of image tokens.

3 3 FIGS.A-B 102 313 310 315 315 313 310 315 As further illustrated in, the generative AI digital visual systemadds the timestep embeddingto the set of image tokensto generate a set of anchor tokens. In one or more embodiments, the set of anchor tokensrefers to the timestep embeddingadded to the set of image tokens. Specifically, the set of anchor tokensrefers to a conditioning input to the diffusion transformer model.

315 310 300 102 315 300 102 315 102 315 In other words, the set of anchor tokensrefers to an anchor/guide that the generative AI digital visual system uses as a guide for denoising/removing noise from noised tokens. As mentioned above, the set of image tokenscorresponds to the digital imageas part of a visual prompt. Thus, at inference time, the generative AI digital visual systemuses the set of anchor tokensto ensure that the output (e.g., the generated digital video) includes the digital imagefrom the visual prompt. Specifically, the generative AI digital visual systemdenoises noised tokens according to the set of anchor tokens. In other words, the generative AI digital visual systemanchors its generative mechanism to creating a digital video that includes the fully denoised content (e.g., the set of anchor tokens).

3 FIG.A 102 304 102 304 102 304 300 102 300 102 304 As shown in dotted lines in, in some embodiments, the generative AI digital visual systemreceives a conditional promptfrom a client device. For example, the generative AI digital visual systemreceives the conditional prompteither implicitly or expressly from an image-to-video request. Specifically, the generative AI digital visual systemreceives the conditional promptimplicitly by receiving the digital image, and the generative AI digital visual systemassumes that the digital imageis a condition of generating the digital video. In some embodiments, the generative AI digital visual systemreceives the conditional promptexpressly as part of a text prompt.

102 102 102 102 In one or more embodiments, the generative AI digital visual systemreceives image-to-video request as a text prompt. In particular, the generative AI digital visual systemreceives a text prompt from a client device that textually describes content to be included within a digital video generated by the generative AI digital visual system. For instance, the text prompt describes specific parameters to be included in the media generated by the generative AI digital visual system.

304 102 304 304 102 In one or more embodiments, the conditional promptrefer to the generative AI digital visual systemgenerating data based on specific input conditions. Specifically, the conditional promptrefer to specific instructions to guide the generation process to produce an output that aligns with the given context. For instance, in some embodiments, the digital image acts as the conditional prompt. In other words, the generative AI digital visual systemreceives the digital image as a condition for generating the digital video (e.g., the digital video must include the digital image as one or more frames).

102 304 102 102 In one or more embodiments, the generative AI digital visual systemreceives the conditional promptas part of an indication by a client device (e.g., checking a box to indicate that the uploaded digital image must be part of the generated digital video). In some embodiments, the generative AI digital visual systemreceives the conditional prompt as instructions part of a text prompt and a visual prompt. For instance, the generative AI digital visual systemreceives the digital image and further receives text instructions “generate a digital video of a car driving at night towards the city” or “generate a digital video of a car driving at night towards the city, like the image just provided.”

3 FIG.B 102 316 304 102 316 As shown in dotted lines in, in some embodiments, the generative AI digital visual systemfurther receives an unconditional prompt. Similar to the conditional prompt, in some embodiments, the generative AI digital visual systemreceives the unconditional promptas express or implicit instructions.

316 102 102 316 102 316 In one or more embodiments, the unconditional promptrefer to the generative AI digital visual systemgenerating data without conditioning or guidance. Specifically, the generative AI digital visual systemdoes not rely on specific inputs or prompts to guide the output generation (e.g., the digital video). In other words, the unconditional promptdoes not constrain the generative process with external data (e.g., a digital image). To illustrate, the generative AI digital visual systemreceives the unconditional promptas part of the image-to-video request (e.g., generate a video with poor resolution and bad aesthetics).

3 3 FIGS.A-B 102 312 304 316 102 312 As shown in, the generative AI digital visual systemfurther utilizes a text encoderto process the conditional promptand the unconditional prompt. In one or more embodiments, the generative AI digital visual systemutilizes the text encoderto process a text prompt. In particular, the text encoder includes a component of a neural network to transform textual data (e.g., the text prompt) into a numerical representation.

102 312 102 102 For instance, the generative AI digital visual systemutilizes the text encoderto transform the text prompt into a text encoding (e.g., text tokens). Further, the generative AI digital visual systemutilizes the text encoder in a variety of ways. For instance, the generative AI digital visual systemutilizes the text encoder to i) determine the frequency of individual words in the text prompt (e.g., each word becomes a feature vector), ii) determines a weight for each word within the text prompt to generate a text vector that captures the importance of words within a text prompt, iii) generates low-dimensional text vectors in a continuous vector space that represents words within the text prompt, and/or iv) generates contextualized text vectors by determining semantic relationships between words within the text prompt.

102 314 304 320 316 102 312 102 In one or more embodiments, the generative AI digital visual systemgenerates text tokensfrom the conditional promptand text tokensfrom the unconditional prompt. For example, the generative AI digital visual systemutilizes the text encoderto generate a representation of the text prompt for a machine learning task. Specifically, a single text token refers to a word, a sub-word, or a character (e.g., “the,” “on,” “cat,” “t,” “showcasing,” “show,” “casing,” etc.). Furthermore, the generative AI digital visual systemgenerates tokens representing special meaning or purposes such as the beginning or an end of a sentence.

102 102 102 410 418 4 FIG. 4 FIG. As mentioned above, the generative AI digital visual systemutilizes multiple passes through a diffusion transformer model to generate a final output token.illustrates the generative AI digital visual systemperforming a first pass and a second pass to generate a final output token in accordance with one or more embodiments. For example,shows the generative AI digital visual systemutilizing a diffusion transformer modelto generate a final token output.

In one or more embodiments a machine learning model includes a computer algorithm or a collection of computer algorithms that is trained and/or tuned based on inputs to approximate unknown functions. For example, a machine learning model includes a computer algorithm with branches, weights, or parameters that changed based on training data to improve for a particular task. Thus, a machine learning model utilizes one or more learning techniques to improve in accuracy and/or effectiveness. Example machine learning models include various types of decision trees, support vector machines, Bayesian networks, random forest models, or neural networks (e.g., deep neural networks).

Similarly, a neural network includes a machine learning model of interconnected artificial neurons (e.g., organized in layers) that communicate and learn to approximate complex functions and generate outputs based on a plurality of inputs provided to the model. In some instances, a neural network includes an algorithm (or set of algorithms) that implements deep learning techniques that utilize a set of algorithms to model high-level abstractions in data. To illustrate, in some embodiments, a neural network includes a convolutional neural network, a recurrent neural network (e.g., a long short-term memory neural network), a transformer neural network, a generative adversarial neural network, a graph neural network, a diffusion neural network, or a multi-layer perceptron. In some embodiments, a neural network includes a combination of neural networks or neural network components.

102 102 In one or more embodiments, the generative AI digital visual systemutilizes a diffusion model as the neural network. For example, the diffusion model refers to a generative machine learning model that reconstructs data by removing noised input data. Specifically, the generative AI digital visual systemtrains the diffusion model to remove noise, compares a denoised representation to a ground truth, and modifies parameters of the diffusion model.

102 102 In one or more embodiments, the generative AI digital visual systemutilizes a diffusion transformer model. Specifically, the diffusion transformer model refers to a model architecture that leverages principles of diffusion models with a transformer architecture. For example, the diffusion transformer model includes deep learning self-attention mechanisms that process sequential data. For instance, the diffusion transformer model establishes relationships between elements in a sequence using self-attention mechanisms. To illustrate, the generative AI digital visual systemutilizes the diffusion transformer model to denoise noised representations (e.g., noised tokens) to reconstruct data and generate media (e.g., video, images, text, etc.).

102 410 102 412 In one or more embodiments, a first pass refers to the generative AI digital visual systempassing an initial input through the diffusion transformer model. Specifically, the generative AI digital visual systemgenerates a conditional token outputfrom the first pass.

4 FIG. 4 FIG. 102 401 401 400 402 404 102 404 illustrates the generative AI digital visual systemgenerating a combined token. For example,shows the combined tokenincluding text tokensfrom conditional prompt, a set of anchor tokens, and noised tokens. Specifically, at inference time, the generative AI digital visual systeminitializes the noised tokens.

102 404 404 102 410 404 102 404 404 402 102 404 In one or more embodiments, the generative AI digital visual systeminitializes the noised tokensfrom a noise distribution to generate the noised tokens. Specifically, at inference time (e.g., runtime), the generative AI digital visual systemutilizes the diffusion transformer modelto process the noised tokens. For instance, the generative AI digital visual systeminitializes the noised tokensand processes the noised tokensalong with the set of anchor tokens. For instance, the generative AI digital visual systemgenerates the noised tokensby adding Gaussian noise sampled from a normal distribution with a mean of zero and a specified standard deviation, where the noise distribution ranges from t=0 to t=1000, t=1000 indicates that the token is fully noised, and t=0 indicates that the token is fully denoised.

401 102 402 404 401 102 410 102 400 402 404 401 102 401 In one or more embodiments, the combined tokenrefers to the generative AI digital visual systemcombining (e.g., concatenating) the set of anchor tokensand the noised tokensto pass through a diffusion transformer model. Specifically, the combined tokenrefers to the generative AI digital visual systemgenerating a combined input for a first pass through the diffusion transformer model. In some embodiments, the generative AI digital visual systemalso combines the text tokenswith the set of anchor tokens, and the noised tokensto generate the combined token. Specifically, the generative AI digital visual systemgenerates the combined tokenfrom text tokens for the conditional prompt of an image-to-video request.

4 FIG. 102 412 410 412 102 401 102 401 402 412 As illustrated in, the generative AI digital visual systemgenerates the conditional token outputfrom a first pass through the diffusion transformer model. In one or more embodiments, the conditional token outputrefers to the denoised tokens generated by the generative AI digital visual systemprocessing the combined token. Specifically, the generative AI digital visual systemuses the combined tokento generate denoised tokens according to the set of anchor tokens(e.g., the denoising process is guided by the digital image). Thus, the conditional token outputrefers to data that indicates the conditional aspects of an image-to-video request.

102 414 102 410 102 414 102 414 403 Furthermore, as shown, the generative AI digital visual systemfurther performs a second pass to generate an unconditional token output. In one or more embodiments, a second pass refers to the generative AI digital visual systempassing a second input through the diffusion transformer model. Specifically, the generative AI digital visual systemgenerates the unconditional token outputfrom the second pass. For example, the generative AI digital visual systemgenerates the unconditional token outputfrom an additional combined token.

403 102 402 408 410 102 403 406 402 408 401 404 404 In one or more embodiments, the additional combined tokenrefers to the generative AI digital visual systemcombining (e.g., concatenating) the set of anchor tokensand noised tokensfor a second pass through the diffusion transformer model. In some embodiments, the generative AI digital visual systemgenerates the additional combined tokenby combining text tokensfrom the unconditional prompt of the image-to-video request, the set of anchor tokens, and the noised tokens(e.g., noised tokens initialized for the combined tokenor additional noised tokens that are either contain a same level of noise as the noised tokensor a different level of noise as the noised tokens).

403 406 102 402 402 102 Although the additional combined tokenis for a second pass and uses the text tokensfrom the unconditional prompt, in one or more embodiments, the generative AI digital visual systemuses the set of anchor tokenson the second pass. In using the set of anchor tokensfor both the first pass and the second pass, the generative AI digital visual systemimproves the accuracy and quality of a generated digital video (e.g., especially if the objective is to include a digital image within the generated digital video).

4 FIG. 102 410 414 403 102 414 402 414 As shown in, the generative AI digital visual systemfurther utilizes the diffusion transformer modelto generate the unconditional token outputfrom the additional combined token. Specifically, the generative AI digital visual systemgenerates the unconditional token outputbased on the unconditional prompt and is further guided in the denoising process by the set of anchor tokens. Thus, the unconditional token outputindicates unconditional aspects of the image-to-video request.

4 FIG. 102 418 412 414 418 412 414 418 102 416 410 412 414 102 418 further shows the generative AI digital visual systemgenerating a final token outputfrom the conditional token outputand the unconditional token output. In one or more embodiments, the final token outputrefers to a combination of the conditional token outputand the unconditional token output. Specifically, the final token outputrefers to the generative AI digital visual systemusing a classifier free guidance modelto encourage the diffusion transformer modelto generate a digital video based on either the conditional token outputor the unconditional token output. For instance, the generative AI digital visual systemuses the final token outputto generate the digital video.

416 102 416 410 102 102 416 412 414 In one or more embodiments, the classifier free guidance modelrefers to a model that does not rely on an express classifier. Specifically, the generative AI digital visual systemuses the classifier free guidance modelto steer an output of the diffusion transformer modeltowards desired characteristics. For instance, the generative AI digital visual systemuses a weight to indicate whether to favor the conditional or unconditional output more heavily. In some embodiments, the generative AI digital visual systemuses the classifier free guidance modelto interpolate between the conditional token outputand the unconditional token output. Specifically, the interpolation includes a guidance scale that encourages the diffusion transformer model to generate the digital video based on the conditional token output.

102 402 402 As discussed above, the generative AI digital visual systemutilizes the set of anchor tokensto guide the removal of noise and to anchor a digital image as a certain frame within a generated digital video. In some embodiments, the set of anchor tokenscorresponds to an initial frame of a digital video. For instance, the image-to-video request includes a digital image and further indicates that the generated product should include the digital image as the initial frame of the digital video.

402 402 In some embodiments, the set of anchor tokenscorresponds to an intermediate frame of a digital video. For instance, the image-to-video request includes a digital image and further indicates that the generated product should include the digital image as one or more of the intermediate frames of the video. In some embodiments, the set of anchor tokenscorresponds to a final frame of a digital video. For instance, the image-to-video request includes a digital image and further indicates that the generated product should include the digital image as the last or final frame of the video.

402 402 In some embodiments, the set of anchor tokenscorresponds to a keyframe of a digital video. For instance, the image-to-video request includes a digital image and further indicates that the generated product should include the digital image as one or more keyframes. In some embodiments, the set of anchor tokenscorresponds to a motion frame of a digital video. For instance, the image-to-video request includes a digital image and further indicates that the generated product should include the digital image as one or more motion frames.

402 102 In some embodiments, the set of anchor tokenscorresponds to a portion of a frame of a digital video. For instance, the image-to-video request includes a digital image and further indicates that the generated product should include the digital image as a portion of a frame (e.g., first, intermediate, last, keyframe, or motion frame). In other words, the generative AI digital visual systemuses the digital image to encompass a partial frame in the generated digital video and fills in the rest of the frame with additional content.

102 102 102 501 500 502 504 5 FIG. 5 FIG. As mentioned above, the generative AI digital visual systemutilizes a diffusion transformer model with a streamlined architecture.illustrates the generative AI digital visual systemutilizing transformer blocks of a diffusion transformer model to generate a conditional token output and an unconditional token output in accordance with one or more embodiments. For example,shows the generative AI digital visual systemprocessing a combined tokenthat includes text tokens(e.g., from the conditional prompt for generating a conditional token output or from the unconditional prompt for generating an unconditional token output), a set of anchor tokens, and noised tokens.

5 FIG. 102 506 501 102 102 As shown in, the generative AI digital visual systemutilizes a first transformer blockof a diffusion transformer model to process the combined token. In one or more embodiments, a transformer block refers to an individual block in a single stream transformer. Specifically, the generative AI digital visual systemutilizes a transformer block of a single stream transformer to remove noise from a noised token. For instance, for a single stream transformer with multiple transformer blocks, the generative AI digital visual systemutilizes a first transformer block to remove some noise from a noised token to generate an intermediate denoised token.

102 504 102 506 102 102 524 In one or more embodiments, the generative AI digital visual systemutilizes transformer blocks to remove at least some noise from the noised tokens. For instance, the generative AI digital visual systemgenerates an intermediate denoised token by using the first transformer block. Specifically, an intermediate denoised token refers to a partially noised token. Specifically, once the generative AI digital visual systemutilizes the single stream transformer to remove all the noise from the noised tokens, the generative AI digital visual systemgenerates the denoised tokens (e.g., conditional/unconditional token output).

5 FIG. 506 508 510 512 514 508 102 508 shows that the first transformer blockincludes a self-attention layer, a combined self-attention layer output, a multi-layer perceptron, and a multi-layer perceptron output. In one or more embodiments, the self-attention layerrefers to layer that captures the importance of different tokens (e.g., words or patches) in a sequence relative to each other. Specifically, the generative AI digital visual systemutilizes the self-attention layerto capture relationships and dependencies between tokens (e.g., for both short-range and long-range dependencies).

102 508 102 508 In other words, the generative AI digital visual systemutilizes the self-attention layerto determine how much attention a token should give to another token. To illustrate, the generative AI digital visual systemutilizes the self-attention layerto generate three vectors for each token, 1) a query vector (e.g., represents the token seeking information from other tokens), 2) a key vector (e.g., represents the token providing information to other tokens), and 3) a value vector (e.g., represents the actual content of the token).

102 508 102 510 In one or more embodiments, the generative AI digital visual systemutilizes the self-attention layerto generate a self-attention layer output that represents an updated set of intermediate noised tokens (e.g., or denoised tokens) that incorporate information from other noised tokens (e.g., the updated set of noised tokens represents relationships between tokens). In one or more embodiments, the generative AI digital visual systemfurther combines the self-attention layer output with the initial input to the transformer block corresponding to the self-attention layer to generate the combined self-attention layer output.

102 512 512 512 514 512 In one or more embodiments, a generative AI digital visual systemutilizes the multi-layer perceptron. For example, the multi-layer perceptronrefers to an artificial neural network with multiple layers of neurons that are fully connected. Specifically, the multi-layer perceptronincludes an input layer, where the input data is fed into the network, hidden layers (e.g., intermediate layers between an input and output layer, where the hidden layers receive input from all the neurons in the previous layer), and an output layer that generates the multi-layer perceptron output(e.g., by combining the combined self-attention layer output with the output from the multi-layer perceptron).

5 FIG. 5 FIG. 102 516 518 520 102 522 524 shows the generative AI digital visual systemgenerating a first set of intermediate denoised tokensand utilizing a second transformer blockto further generate a second set of intermediate denoised tokens. Moreover,shows the generative AI digital visual systemutilizing a Nth transformer blockto generate the conditional/unconditional token output.

6 FIG. 6 FIG. 102 102 602 600 illustrates the generative AI digital visual systemtransforming a final token output into a digital video in accordance with one or more embodiments. For example,shows the generative AI digital visual systemutilizing a detokenization modelto process a final token output.

102 600 602 102 602 600 In one or more embodiments, the generative AI digital visual systemtransforms the final token output(e.g., denoised tokens) into embeddings by utilizing the detokenization model. For example, the generative AI digital visual systemutilizes the detokenization modelto unpatchify denoised tokens. Specifically, unpatchification involves a reverse process of patchification to reconstruct an image (e.g., a sequence of frames) from a set of denoised tokens (e.g., the final token output).

102 600 102 602 604 For instance, the generative AI digital visual systemrearranges the denoised tokens (e.g., the final token output) and combines the rearranged denoised tokens into an initial (original) image structure/frame. In other words, the generative AI digital visual systemutilizes the detokenization modelto rearrange tokens to resemble embeddings(e.g., an entire frame put together).

102 606 608 102 606 604 608 606 102 Furthermore, in some embodiments, the generative AI digital visual systemutilizes a decoderto process the denoised tokens (which have been unpatchified) and generates a media item such as a digital video. In one or more embodiments, the generative AI digital visual systemutilizes the decoderthat includes one or more layers (e.g., linear transformation, self-attention layer, softmax layer, etc.) to transform the embeddingsinto the digital video. Specifically, the decodertransforms denoised tokens in the latent space to images/frames in the pixel space. In one or more embodiments, the generative AI digital visual systemutilizes one or more decoders of a dual-variational autoencoder model, which is described in application Ser. No. 18/930,665, titled DUAL-VAE FOR MORE EFFICIENT AND EFFECTIVE DIFFUSION MODEL TRAINING, filed on Oct. 29, 2024, which is fully incorporated by reference herein.

102 102 102 7 FIG. 7 FIG. As mentioned above, the generative AI digital visual systemtrains a diffusion transformer model in a manner to more accurately include a digital image in a generated digital video.illustrates the generative AI digital visual systemreceiving a training digital video and generating image tokens (e.g., a set of training tokens) from the training digital video in accordance with one or more embodiments. For example,shows that at training time, the generative AI digital visual systemreceives a training digital video that includes a sequence of frames (e.g., a sequence of training frames). To illustrate, a sequence of training frames of a training digital videos includes a hundred image tokens, where the first ten tokens correspond to a first frame, and the next ninety image tokens correspond to the next nine frames of the sequence of frames (e.g., ten tokens per frame).

7 FIG. 7 FIG. 102 708 710 702 720 704 730 706 102 712 714 702 724 704 732 706 illustrates the generative AI digital visual systemutilizing an encoderto generate embeddingfor a first frame, embeddingfor a second frame, and embeddingfor an Nth frame. Furthermore,shows the generative AI digital visual systemutilizing a tokenization modelto generate image tokensfor the first frame, image tokensfor the second frame, and image tokensfor the Nth frame.

102 102 102 In one or more embodiments, at training time, the generative AI digital visual systemadds noise to the image tokens (e.g. clean tokens corresponding to frames of a training video) over several timesteps. For instance, the generative AI digital visual systemadds noise to tokens over a number of timesteps corresponding to a number of transformer blocks (e.g., denoising blocks) in the diffusion transformer model. Specifically, the generative AI digital visual systemrandomly samples from the noise distribution to determine how much noise (t ranges from 0-1000) to add to the clean image tokens.

102 102 102 724 726 732 734 102 714 714 702 7 FIG. 7 FIG. In some embodiments, the generative AI digital visual systemadds the same amount of noise to all the image tokens during training, while in some embodiments, the generative AI digital visual systemvaries the amount of noise added to tokens. Specifically,shows the generative AI digital visual systemadding noise to image tokensto generate noised tokensand adding noise to image tokensto generate noised tokens(e.g., noised training tokens). However, as illustrated in, the generative AI digital visual systemdoes not add any noise to the image tokens(e.g., indicated by t=0) as the image tokensis from the first framewhich is being used as a condition/anchor frame (e.g., the training anchor tokens).

102 716 714 728 724 736 732 102 726 734 7 FIG. As further shown, the generative AI digital visual systemadds timestep embeddingto the image tokens, adds timestep embeddingto the image tokens, and adds timestep embeddingto the image tokens. As discussed above, the timestep embeddings indicate to the generative AI digital visual systemutilizing a diffusion transformer as to the amount of noise added to the image tokens. Thus, as shown in, t=T indicates that the noised tokensand the noised tokensare fully noised.

102 714 102 714 716 726 734 102 714 As discussed above in context of inference time, the generative AI digital visual systemutilizes a set of anchor tokens (e.g., the image tokenswith no noise added that act as the training anchor tokens) to remove noise from noised tokens. In one or more embodiments, at training time, the generative AI digital visual systemalso uses the set of anchor tokens (e.g., the image tokenswith no noise added as indicated by the timestep embedding) to remove noise from the noised tokensand the noised tokens. Specifically, the noised tokens originate from a sequence of frames (e.g., excluding the anchor frame, which is the frame used to create the set of anchor tokens at training time). Accordingly, the generative AI digital visual systemuses the set of anchor tokens (e.g., the image tokens) at training time to guide the denoising process.

7 FIG. 7 FIG. 7 FIG. 102 738 738 102 740 742 Moreover,shows that in one or more embodiments, the generative AI digital visual systemfurther utilizes a text promptat training time. Specifically,shows the text promptas “a car driving at night towards the city.” Furthermore,shows the generative AI digital visual systemutilizing a text encoderto generate text tokens.

102 102 102 In one or more embodiments, the generative AI digital visual systemfurther staggers the anchor frames across multiple frames of a sequence of frames. Specifically, for a sequence of frames that includes 50 frames, the generative AI digital visual systemutilizes the first, the eleventh, the twenty-first, the thirty-first and the forty-first frames as the anchor frames for training purposes. For instance, the generative AI digital visual systemoptimizes a diffusion transformer model to treat a subset of frames as the anchor frames for removing noise from noised tokens.

8 FIG. 8 FIG. 8 FIG. 102 102 804 800 802 102 806 808 804 102 810 812 814 808 further illustrates the generative AI digital visual systemadding spatial-temporal positional encodings to noised tokens and/or the set of anchor tokens in accordance with one or more embodiments. For example,shows the generative AI digital visual systemgenerating a frame N embeddingfor frame Nby using an encoder. Furthermore,shows the generative AI digital visual systemutilizing a tokenization modelto generate image tokensfrom the frame N embedding. Moreover, in some embodiments, the generative AI digital visual systemgenerates a temporal embeddingand a spatial embeddingfor an image tokenof the image tokens.

812 812 102 In one or more embodiments, the spatial embeddingrefers to a representation of spatial relationships and positions of visual elements within a frame (e.g., an image) of a sequence of frames. Specifically, the spatial embeddingincludes an indication of where objects/elements in a frame are located, the orientation of objects/elements, the size of objects/elements, and their spatial relationship with different regions of the frame that they are located within. For instance, the generative AI digital visual systemutilizes coordinate information (e.g., x-dimension and y-dimension, and in some embodiments a z-dimension) for objects/elements within a frame.

812 812 102 812 814 816 814 In some embodiments, the spatial embeddingindicates absolute position within a frame and in some embodiments, the spatial embeddingindicates relative position (e.g., relative to other objects/elements within a frame). In one or more embodiments, the generative AI digital visual systemutilizes a centered two-dimensional coordinate map to generate the spatial embeddingof the image tokenwith noiseadded to the image token.

810 102 810 102 810 In one or more embodiments, the temporal embeddingrefers to a representation of a frame within a sequence of visual frames. Specifically, the generative AI digital visual systemutilizes the temporal embeddingto capture motion information, action sequences, and transitions between frames within a sequence of frames. In other words, the generative AI digital visual systemgenerates the temporal embeddingto create a representation of sequential dependencies between frames of a sequence of frames.

102 810 102 814 814 In one or more embodiments, the generative AI digital visual systemgenerates the temporal embeddingbased on a timestamp and an inverse timestamp. For example, the generative AI digital visual systemdetermines a timestamp for the image token(e.g., for a first frame of a sequence of frames of the video). Specifically, a timestamp of a first frame refers to a specific point in time at which a frame of the image tokenappears within the overall video or the sequence of frames, relative to the start of the video.

102 814 102 814 816 810 Furthermore, the generative AI digital visual systemdetermines an inverse timestamp, which refers to a difference in a total length of the video and the temporal position (e.g., current position) of the frame of the image tokenrelative to the sequence of frames. Moreover, the generative AI digital visual systemcombines the timestamp and the inverse timestamp of the image tokenwith the noiseto generate the temporal embedding.

102 810 812 To illustrate, the generative AI digital visual systemutilizes the methods discussed in application Ser. No. 18/930,681, titled POSITIONAL EMBEDDING AND TRAINING TECHNIQUES FOR A DIFFUSION MODEL, filed on Oct. 29, 2024, which is fully incorporated by reference herein to generate the temporal embeddingand the spatial embedding.

8 FIG. 102 812 810 818 818 818 102 818 808 818 As further shown in, the generative AI digital visual systemcombines the spatial embeddingand the temporal embeddingto generate spatial-temporal positional encodings. In one or more embodiments, the spatial-temporal positional encodingsrefer to a data representation of information relating to both spatial relationships and positions of visual elements within a frame and motion information, action sequences, and transitions between frames within a sequence of frames (e.g., sequential dependencies between frames). Specifically, the spatial-temporal positional encodingsincludes a combined data representation that captures information from the visual dimension and the temporal dimension. Accordingly, the generative AI digital visual systemutilizes the spatial-temporal positional encodingsto remove noise from noised tokens (e.g., the noised set of the image tokens) in a high-quality and accurate manner (e.g., to incorporate the context indicated by the data in the spatial-temporal positional encodings).

8 FIG. 102 814 816 818 102 814 816 818 818 102 818 Further, as shown in, the generative AI digital visual systemcombines/adds the image tokenwith the noisewith the spatial-temporal positional encodings(e.g., to generate a combined noised token with spatial-temporal positional encodings). Thus, the generative AI digital visual systemprocesses the image tokenwith the noiseand the spatial-temporal positional encodingsusing a diffusion transformer model to remove noise according to the spatial-temporal positional encodings. To reiterate, the generative AI digital visual systemutilizes the spatial-temporal positional encodingsin tandem with the set of anchor tokens (discussed above) as a guide to remove noise from noised tokens.

102 818 818 In one or more embodiments, the generative AI digital visual systemutilizes the spatial-temporal positional encodings(e.g., the combined noised token with spatial-temporal positional encodings) as a foundation for anchoring image token(s) or an entire frame in a generated digital video. Specifically, the spatial-temporal positional encodingsis a combined data representation that captures information from the visual dimension and the temporal dimension (e.g., the encoding contains information such as where a visual component should spatially and temporally appear).

102 818 102 818 102 818 As such, the generative AI digital visual systemuses the spatial-temporal positional encodingsas an anchor to indicate to the diffusion transformer model to include an image token, multiple image tokens, or an entire image frame at specific instances in a generated digital video. For instance, the generative AI digital visual systemleverages the spatial-temporal positional encodingsto indicate that an image patch (e.g., an image patch corresponding to an image token) should be included in every other frame of a digital video. In other words, the generative AI digital visual systemuses the spatial-temporal positional encodingsto indicate the space and time of where an image token should be included in a generated digital video.

102 102 818 818 To further illustrate, the generative AI digital visual systemreceives a visual prompt (e.g., a digital image) and further receives instructions indicating that a generated digital video should include a specific object (e.g., a car) portrayed in the digital image at the top right corner of every frame of the generated digital video. In response, the generative AI digital visual systemgenerates the spatial-temporal positional encodingsindicating the spatial location (e.g., top right corner) and the temporal location (e.g., every frame of a digital video) by adjusting an associated timestep to indicate that the spatial-temporal positional encodingsis fully denoised, and thus, is not denoised during the denoising process.

102 102 818 In some instances, the generative AI digital visual systemreceives a visual prompt and further receives instructions indicating that a generated digital video should include the entire visual prompt (e.g., a digital image) every 3 seconds (e.g., or at the 25 second mark of the digital video) within a generated digital video. In response, the generative AI digital visual systemgenerates the spatial-temporal positional encodingsto conform with the received instructions.

8 FIG. 818 102 818 102 102 Althoughdiscusses generating the spatial-temporal positional encodingsin context of training, in one or more embodiments, the generative AI digital visual systemalso uses the spatial-temporal positional encodingsat inference time. Specifically, in response to an image-to-video request from a client device, the generative AI digital visual systemgenerates spatial-temporal positional encodings for a digital image part of a visual prompt. Furthermore, the generative AI digital visual systemgenerates spatial-temporal positional encodings for any media attributes indicated by a client device.

For instance, the media attributes include a type of media (e.g., an image or a video), a format of the media, a subject matter of the media, a style of the media, a mood or theme, and any additional details (e.g., aspect ratio, frames per second, shot size, camera angle, a type of motion such as zooming in or zooming out, etc.).

9 FIG. 7 FIG. 9 FIG. 102 102 102 902 904 906 illustrates the generative AI digital visual systemgenerating a measure of loss and modifying parameters of a diffusion transformer model in accordance with one or more embodiments. As discussed above in, the generative AI digital visual systemgenerates image tokens for a training digital video, adds noise to the image tokens (e.g., a set of training tokens, and the generative AI digital visual systemdoes not add noise to training tokens of an anchor/conditioning frame), and removes noise from the noised tokens (e.g., noised training tokens). Specifically,shows a set of anchor tokens(e.g., for a conditioning/anchor frame, also known as a set of training anchor tokens), noised tokens, and noised tokens.

9 FIG. 9 FIG. 102 908 910 902 904 906 102 908 912 Furthermore,shows the generative AI digital visual systemutilizing a diffusion transformer model(e.g., that includes a transformer block) to process the set of anchor tokens, the noised tokens, and the noised tokens. Moreover,shows the generative AI digital visual systemutilizing the diffusion transformer modelto generate the denoised tokens.

102 912 908 912 102 908 902 In one or more embodiments, the generative AI digital visual systemgenerates the denoised tokens(e.g., denoised training tokens) from the noised tokens using the diffusion transformer model(e.g., single stream transformer). Specifically, the denoised tokensrefers to a clean version of data with the noise added to the token removed. For instance, over a number of denoising timesteps (e.g., transformer blocks), the generative AI digital visual systemutilizes the diffusion transformer modelto remove the noise from the noised tokens according to the set of anchor tokens.

102 914 602 916 102 916 918 918 102 102 916 908 6 FIG. 7 FIG. As further shown, the generative AI digital visual systemalso uses a detokenization model(e.g., the detokenization modeldescribed above in) to generate denoised embeddings(e.g., denoised training embeddings). As shown, the generative AI digital visual systemcompares the denoised embeddingswith embeddings. Specifically, the embeddingsoriginate from the generative AI digital visual systeminitially utilizing an encoder to generate embeddings from a sequence of frames of a training digital video (e.g., as shown in). In other words, the generative AI digital visual systemcompares the pre-tokenized form of the sequence of frames of a training digital video with the denoised embeddingsto determine a level of accuracy of the diffusion transformer model.

9 FIG. 102 920 916 918 102 920 102 As shown in, the generative AI digital visual systemgenerates a measure of lossfrom comparing the denoised embeddingsand the embeddings. In one or more embodiments, the generative AI digital visual systemdetermines the measure of lossby comparing a similarity between a predicted embedding (e.g., denoised embeddings) and a ground truth embedding. Specifically, the generative AI digital visual systemdetermines a mean squared error (MSE) loss to measure an average squared difference between corresponding elements of a predicted embedding and a ground truth embedding. For instance, the goal of MSE loss is to minimize the error between a prediction and a ground truth.

10 FIG. 10 FIG. 10 FIG. 102 1000 104 110 102 1000 1012 102 105 1002 1006 1010 1012 Turning to, additional detail will now be provided regarding various components and capabilities of the generative AI digital visual system. In particular,illustrates an example schematic diagram of a computing device(e.g., the server(s)and/or the client device) implementing the generative AI digital visual systemin accordance with one or more embodiments of the present disclosure for components-. As illustrated in, the generative AI digital visual systemincludes a frame anchoring system, a diffusion transformer model manager, an image-to-video request manager, an anchor token manager, a digital video manager, and a storage manager.

1002 1002 1002 1002 The diffusion transformer model managergenerates denoised tokens. For example, the diffusion transformer model managerutilizes a streamlined architecture without additional modulation or conditioning layers to process an input in a single stream manner. Specifically, the diffusion transformer model managermanages the training and optimization of a diffusion transformer model. For instance, the diffusion transformer model managerreceives a training digital video, generates various embeddings/tokens, and further generates a measure of loss to modify parameters of a diffusion transformer model.

1004 1004 1004 1004 The image-to-video request managerreceives one or more media requests from a client device. For example, the image-to-video request managerprovides a graphical user interface to a client device to input data for an image-to-video request. In one or more embodiments, the image-to-video request managerprovides options for a client device to upload one or more digital images, input text describing unconditional and/or conditional prompt, and further allows for a client device to adjust preset parameters (e.g., digital video parameters such as camera angles, lighting, speed, frame rate, etc.). Moreover, in one or more embodiments, the image-to-video request managerpasses this data to additional components.

1006 1006 1004 1006 1006 1006 1002 The anchor token managergenerates a set of anchor tokens. For example, the anchor token managerreceives a digital image from the image-to-video request managerand further determines to not add noise to the digital image. Further, the anchor token managergenerates an embedding from the received digital image, and further tokenizes the embedding (e.g., generates a set of image tokens). Moreover, the anchor token manageradds a timestep embedding to the set of image tokens to indicate that the set of image tokens are fully denoised. In doing so, the anchor token managerindicates to the diffusion transformer model managerto use the set of anchor tokens as a guide in removing noise from noised tokens.

1008 1008 1008 The denoised token managergenerates denoised tokens. For example, the denoised token manageruses a diffusion transformer model to process the set of anchor tokens and noised tokens. Furthermore, the denoised token managerutilizes multiple transformer blocks of a diffusion transformer model to remove noise from noised tokens to generate a set of denoised tokens.

1010 1010 1010 1010 The digital video managergenerates a digital video. For example, the digital video managergenerates a digital video from denoised tokens. Specifically, the digital video managerdetokenizes denoised tokens and further utilizes a decoder to create a digital video from embeddings (e.g., denoised embeddings). Furthermore, the digital video managercauses a graphical user interface of a client device to display the generated digital video.

1012 102 1012 The storage managerstores various components generated by the generative AI digital visual system. For example, the storage managerstores model parameters (e.g., initial parameters and modified parameters) for a diffusion transformer model, prompts (e.g., visual and textual), generated digital videos in response to prompts, anchor tokens, noised tokens, training digital videos, embeddings, measures of loss, and additional training/initiation data for preparing a diffusion transformer model to generate a digital video from a digital image.

1002 1012 102 1002 1012 102 1002 1012 1002 1012 102 Each of the components-of the generative AI digital visual systemcan include software, hardware, or both. For example, the components-can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices, such as a client device or server device. When executed by the one or more processors, the computer-executable instructions of the generative AI digital visual systemcan cause the computing device(s) to perform the methods described herein. Alternatively, the components-can include hardware, such as a special-purpose processing device to perform a certain function or group of functions. Alternatively, the components-of the generative AI digital visual systemcan include a combination of computer-executable instructions and hardware.

1002 1012 102 1002 1012 102 1002 1012 102 1002 1012 102 102 Furthermore, the components-of the generative AI digital visual systemmay, for example, be implemented as one or more operating systems, as one or more stand-alone applications, as one or more modules of an application, as one or more plug-ins, as one or more library functions or functions that may be called by other applications, and/or as a cloud-computing model. Thus, the components-of the generative AI digital visual systemmay be implemented as a stand-alone application, such as a desktop or mobile application. Furthermore, the components-of the generative AI digital visual systemmay be implemented as one or more web-based applications hosted on a remote server. Alternatively, or additionally, the components-of the generative AI digital visual systemmay be implemented in a suite of mobile device applications or “apps.” For example, in one or more embodiments, the generative AI digital visual systemcan comprise or operate in connection with digital software applications such as ADOBE® FIREFLY, ADOBE® AFTER EFFECTS CC, ADOBE® PREMIERE RUSH, and/or ADOBE® PREMIERE PRO CC.

1 10 FIGS.- 11 FIG. 11 FIG. 1002 1012 , the corresponding text, and the examples provide a number of different methods, systems, devices, and non-transitory computer-readable media of the-. In addition to the foregoing, one or more embodiments can also be described in terms of flowcharts comprising acts for accomplishing the particular result, as shown in.may be performed with more or fewer acts. Further, the acts may be performed in different orders. Additionally, the acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar acts.

11 FIG. 11 FIG. 11 FIG. 11 FIG. 11 FIG. 11 FIG. 11 FIG. 11 FIG. 1100 illustrates a flowchart of a series of actsfor generating a digital video based on denoised tokens in accordance with one or more embodiments.illustrates acts according to one embodiment, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in. In some implementations, the acts ofare performed as part of a method. For example, in one or more embodiments, the acts ofare performed as part of a computer-implemented method. Alternatively, a non-transitory computer-readable medium can store instructions thereon that, when executed by at least one processor, cause a computing device to perform the acts of. In one or more embodiments, a system performs the acts of. For example, in one or more embodiments, a system includes at least one memory device. The system further includes at least one server device configured to cause the system to perform the acts of.

1100 1102 1100 1104 1100 1105 1100 1106 1100 1108 The series of actsincludes an actof generating a set of image tokens from a digital image. Further, the series of actsincludes an actof generating a set of anchor tokens from a set of image tokens. Further, the series of actsincludes an actof generating combined tokens from the set of anchor tokens and noised tokens. Moreover, the series of actsincludes an actof generating denoised tokens from noised tokens and the set of anchor tokens. Further, the series of actsincludes an actof generating a digital video based on the denoised tokens.

1102 1104 1105 1106 1108 In particular, the actincludes generating, from an image-to-video request comprising a digital image, a set of image tokens from the digital image. Further, the actincludes generating a set of anchor tokens from the set of image tokens by adding a timestep embedding to the set of image tokens that indicates that the set of image tokens are fully denoised. Further, the actincludes generating combined tokens from the set of anchor tokens and noised tokens that are generated from noise. Moreover, the actincludes generating, utilizing a diffusion transformer model to process the combined tokens, denoised tokens. Furthermore, the actincludes generating a digital video comprising at least a portion of the digital image based on the denoised tokens.

1100 1100 For example, in one or more embodiments, the series of actsincludes receiving, from a client device at inference time, the image-to-video request to generate the digital video comprising the digital image and the image-to-video request indicates that the digital image is to be portrayed in the digital video. In addition, in one or more embodiments, the series of actsincludes wherein the image-to-video request indicates that the digital image is to be included as a first frame of a sequence of frames, an intermediate frame of the sequence of frames, or a final frame of the sequence of frames.

1100 1100 Moreover, in one or more embodiments, the series of actsincludes generating, from the digital image of the image-to-video request, an embedding that represents the digital image. Further, in one or more embodiments, the series of actsincludes generating, utilizing a tokenization model to break down the digital image into a plurality of image patches, the set of image tokens from the embedding.

1100 1100 Moreover, in one or more embodiments, the series of actsincludes initializing the noised tokens by sampling a random level of noise from a noise distribution. Further, in one or more embodiments, the series of actsincludes removing noise from the noised tokens according to the set of anchor tokens utilizing the diffusion transformer model.

1100 Moreover, in one or more embodiments, the series of actsincludes generating a sequence of frames from the denoised tokens, wherein the digital video includes the digital image as at least one of a portion of a frame of the sequence of frames, one or more keyframes in the sequence of frames, or one or more motion frames in the sequence of frames.

1100 1100 1100 Additionally, in one or more embodiments, the series of actsincludes generating, from a frame of a sequence of training frames, a training embedding that represents the frame. Moreover, in one or more embodiments, series of actsincludes generating, utilizing a tokenization model, a set of training tokens from the training embedding. Further, in one or more embodiments, the series of actsincludes generating, utilizing the tokenization model, noised training tokens from the sequence of training frames that does not include the frame.

1100 1100 Furthermore, in one or more embodiments, the series of actsincludes generating training anchor tokens by concatenating timestep embeddings to the set of training tokens to indicate that the set of training tokens are fully denoised. Moreover, in one or more embodiments, the series of actsincludes generating, utilizing the diffusion transformer model to process the training anchor tokens and the noised training tokens, denoised training tokens.

1100 1100 Moreover, in one or more embodiments, the series of actsincludes generating, utilizing a detokenization model, denoised training embeddings from the denoised training tokens. Further, in one or more embodiments, the series of actsincludes comparing the denoised training embeddings with embeddings generated from the sequence of training frames prior to tokenization.

1100 1100 1100 1100 1100 Moreover, in one or more embodiments, the series of actsincludes determining a measure of loss from comparing the denoised training embeddings with the embeddings generated from the sequence of training frames prior to tokenization to modify parameters of the diffusion transformer model. In addition, in one or more embodiments, the series of actsincludes generating, from a first pass of the combined tokens through the diffusion transformer model, a conditional token output from the denoised tokens. Further, in one or more embodiments, the series of actsincludes generating, from a second pass of additional combined tokens through the diffusion transformer model, an unconditional token output from additional denoised tokens. Moreover, in one or more embodiments, the series of actsincludes generating a final token output by combining the conditional token output and the unconditional token output. Further, in one or more embodiments, the series of actsincludes generating the digital video comprising at least the portion of the digital image based on the final token output.

1100 1100 1100 1100 1100 Further, in one or more embodiments, the series of actsincludes generating a set of image tokens from a digital image as part of an image-to-video request. Moreover, in one or more embodiments, the series of actsincludes generating, from a first pass of combined tokens through a trained diffusion transformer model, a conditional token output, wherein the combined tokens comprise a set of anchor tokens from the set of image tokens and noised tokens. Further, in one or more embodiments, the series of actsincludes generating, from a second pass of additional combined tokens through the trained diffusion transformer model, an unconditional token output, wherein the additional combined tokens comprise the set of anchor tokens and additional noised tokens. Moreover, in one or more embodiments, the series of actsincludes generating a final token output by combining the conditional token output and the unconditional token output. Further, in one or more embodiments, the series of actsincludes generating a digital video comprising at least a portion of the digital image based on the final token output.

1100 Moreover, in one or more embodiments, the series of actsincludes receiving, from a client device at inference time, the image-to-video request to generate the digital video and a text prompt that indicates that the digital image is to be portrayed in the digital video.

1100 1100 Further, in one or more embodiments, the series of actsincludes receiving, from a client device, the image-to-video request that comprises a conditional prompt for the trained diffusion transformer model to include in the digital video and an unconditional prompt for the digital video. Moreover, in one or more embodiments, the series of actsincludes wherein the conditional prompt indicates that the digital image is to be portrayed as at least one of an initial frame, an intermediate frame, or a subset of frames in the digital video.

1100 1100 1100 1100 Moreover, in one or more embodiments, the series of actsincludes generating text tokens from a conditional prompt of a text prompt of the image-to-video request. Further, in one or more embodiments, the series of actsincludes generating an image embedding from the digital image. Moreover, in one or more embodiments, the series of actsincludes initializing the noised tokens from a noise distribution. Further, in one or more embodiments, the series of actsincludes combining the text tokens, the image embedding, and the noised tokens.

1100 1100 Moreover, in one or more embodiments, the series of actsincludes generating the set of anchor tokens from the image embedding by generating, utilizing a tokenization model to break down the digital image into a plurality of image patches, the set of image tokens from the image embedding. Further, in one or more embodiments, the series of actsincludes adding a timestep embedding to the set of image tokens from the image embedding, wherein the timestep embedding indicates to the trained diffusion transformer model that the set of image tokens are fully denoised.

1100 1100 1100 1100 Moreover, in one or more embodiments, the series of actsincludes generating text tokens from an unconditional prompt of a text prompt of the image-to-video request. Further, in one or more embodiments, the series of actsincludes generating an image embedding from the digital image. Moreover, in one or more embodiments, the series of actsincludes initializing the noised tokens from a noise distribution. Further, in one or more embodiments, the series of actsincludes combining the text tokens, the image embedding, and the noised tokens.

1100 1100 Moreover, in one or more embodiments, the series of actsincludes interpolating, utilizing a classifier free guidance model, between the conditional token output and the unconditional token output, wherein interpolating comprises a guidance scale that encourages the trained diffusion transformer model to generate the digital video based on the conditional token output. Further, in one or more embodiments, the series of actsincludes generating the final token output based on the interpolation of the classifier free guidance model.

1100 1100 Moreover, in one or more embodiments, the series of actsincludes receiving, from a client device at inference time, the image-to-video request to generate the digital video comprising the digital image. Further, in one or more embodiments, the series of actsincludes wherein the image-to-video request indicates that the digital image is to be included as at least one of a portion of a frame of a sequence of frames, one or more keyframes in the sequence of frames, or one or more motion frames in the sequence of frames.

1100 Moreover, in one or more embodiments, the series of actsincludes training the diffusion transformer model by adding noise to a subset of frames of a sequence of frames of a training video, wherein the subset of frames does not include one or more frames that are anchor frames in the training video.

1100 1100 1100 1100 Further, in one or more embodiments, the series of actsincludes generating, from a first pass of the combined tokens through the diffusion transformer model, a conditional token output from the denoised tokens, wherein the combined tokens comprise the set of anchor tokens and the noised tokens. Moreover, in one or more embodiments, the series of actsincludes generating, from a second pass of additional combined tokens through the diffusion transformer model, an unconditional token output from additional denoised tokens, wherein the additional combined token comprise the set of anchor tokens and additional noised tokens. Further, in one or more embodiments, the series of actsincludes generating a final token output by combining the conditional token output and the unconditional token output. In one or more embodiments, the series of actsincludes generating the digital video comprising at least the portion of the digital image based on the final token output.

12 FIG. 14 FIG. 12 FIG. 12 FIG. 1200 1200 102 102 102 shows an example of a diffusion modelaccording to aspects of the present disclosure. In some examples, a diffusion modeldescribes the operation and architecture of the diffusion transformer model (e.g., single stream diffusion transformer model described above) described with reference to. The diffusion model depicted inis an example of, or includes aspects of, a generative AI digital visual systemas described herein. Specifically, a diffusion transformer model combines principles of diffusion models with principles of transformer models. Accordingly,shows the generative AI digital visual systeminitializing a trained diffusion transformer model by leveraging a forward diffusion process to destroy data and then creating media from the destroyed data using a denoising process. In other words, the generative AI digital visual systemteaches a diffusion transformer model to create generative content from noise using a forward diffusion process and a denoising process.

As an example, diffusion models are generative models that operate by progressively destroying/noising an input signal and learning to reverse the destroyed data to generate new samples. In particular, diffusion models use a forward diffusion process to add noise over a series of timesteps and a reverse diffusion process to remove noise over a number of timesteps corresponding to the forward number of steps.

As a further example, transformer models are designed for sequence-to-sequence modeling tasks (e.g., tasks that generate based on a sequential order, such as generative tasks that leverage tokens). As mentioned above, transformer models typically include self-attention mechanisms, and multi-layer perceptrons (e.g., feedforward networks to move the data through a transformer architecture). Moreover, transformer architecture typically considers positional information such as the spatial-temporal positional encodings discussed above.

With the context of diffusion models and transformer models discussed above, additional details of how diffusion transformer models meld principles from diffusion models and transformer models together are provided herein. Diffusion transformer models are a class of generative neural networks which can be trained to generate new data with features similar to features found in training data. In particular, diffusion transformer models can be used to generate novel media items such as images, audio files, videos, three-dimensional (3D) models or other digital media items. Diffusion transformer models can be used for various media processing tasks including image super-resolution, generation of media items with perceptual metrics, image inpainting, and media manipulation. In particular, the diffusion transformer models differ from existing diffusion model architecture in that it combines transformer architecture with diffusion principles of removing noise from noised tokens. Specifically, the architecture of a diffusion transformer model in the present disclosure includes a self-attention layer and a multi-layer perceptron. In one or more embodiments, the diffusion transformer model does not include conditioning inputs, rather, position encodings and other (clean) tokens (such as anchor tokens) are included with noised tokens as guidance for how a transformer block should remove noise from a noised token.

In one or more embodiments, the diffusion transformer models leverage the architecture of a transformer model to capture long-range dependencies and complex structures in high-dimensional data. Specifically, the diffusion transformer models operate by processing token data of images and text to fully consider the long-range dependencies. Moreover, the diffusion transformer models use the transformer architecture to predict the denoised data at each timestep (e.g., transformer block) and as discussed above, uses a self-attention mechanism to the noised data to understand how noise should be removed across various noised input tokens.

102 102 102 As discussed in detail above, the generative AI digital visual systemutilizes a diffusion transformer model (rather than UNet diffusion architecture) where the generative AI digital visual systemleverages encoders (e.g., VAE encoders) to abstract pixel details into latent representations (e.g., embeddings). For instance, the generative AI digital visual systemutilizes encoders to abstract pixel data into semantic information which is adaptable for use in a transformer architecture (e.g., a transformer architecture captures global context through attention from the latent representations).

102 102 102 Moreover, in one or more embodiments, rather than injecting diffusion information through an adaLN modulation, the generative AI digital visual systemdesigns a diffusion transformer model a single stream manner. In other words, the generative AI digital visual systemutilizes a diffusion transformer model with inputs flowing in and inputs flowing out in a single stream. Thus, in one or more embodiments, the generative AI digital visual systemdoes not utilize adaLN modulation for conditioning inputs, and directly feeds positional encodings, anchor tokens, and/or other encoding information (e.g., token-level diffusion timestep embedding) into a self-attention layer along with nosed tokens.

In one or more embodiments, methods of operating diffusion models include a Denoising Diffusion Probabilistic Model (DDPM) and a Denoising Diffusion Implicit Models (DDIM). In DDPM, the generative process includes reversing a stochastic Markov diffusion process. DDIMs, on the other hand, use a deterministic process so that the same input results in the same output. In some cases, DDIM can reduce the number of timesteps during media generation. Diffusion models may also be characterized by whether the noise is added to the media item itself, or to media features generated by an encoder (i.e., latent diffusion). In a pixel diffusion model, noise is added and removed in pixel space. In a latent diffusion model, the noise is added (and removed) in a latent space of media features rather than in pixel space. Thus, a latent diffusion model generates media features using reverse diffusion, and these media features can be decoded to obtain a synthetic media item.

102 1205 1210 102 1205 1205 1220 1220 1212 12 FIG. In one or more embodiments, the generative AI digital visual systemutilizes a diffusion process to adds noise to datato media that has been transformed from a pixel spaceto a latent space. For instance, the generative AI digital visual systemtransforms the datato the latent space, adds noise to the dataand then denoises noised datausing transformer blocks (e.g., removes noise from the noised tokens to obtain a synthetic media item). Specifically,shows the data(e.g., a visual prompt) being processed by an encoder(e.g., to generate embeddings and then to further generate tokens).

12 FIG. 12 FIG. 12 FIG. 102 1215 1205 102 1225 1220 102 1229 1230 102 102 1220 Furthermore,shows the generative AI digital visual systemutilizing a forward diffusion processto add noise to the data. Moreover,shows the generative AI digital visual systemutilizing a denoising processto remove noise from noised data. For instance,shows the generative AI digital visual systemutilizing a decoderto generate media. Further, in one or more embodiments, the generative AI digital visual systemadds noise to data in a progressive manner (e.g., over a number of timesteps corresponding to a number of transformer blocks). In doing so, the generative AI digital visual systemtrains a diffusion transformer model to create generative content from destroyed data (e.g., the noised data).

12 FIG. 1220 1212 102 1212 1220 1220 As just mentioned,shows the databeing processed by the encoder. As is discussed in some detail above, the diffusion transformer model operates by processing token-level data. To do so, the generative AI digital visual systemuses the encoderto generate embedding(s) of the dataand further transforms the datainto tokens. For instance, the transformation process is referred to as tokenization or patchification. Specifically, tokenization/patchification starts with an image with dimensions of H×W×C, where His the height, W is the width, and C is the number of color channels (3 for RGB images).

102 102 In one or more embodiments, the generative AI digital visual systemuses patchification to divide the image into patches (e.g., partition the image into non-overlapping patches of size P×P) where each patch contains P×P×C pixel values. Further, for an image of size H×W, the total number of patches will be (H/P)×(W/P). Furthermore, the generative AI digital visual systemflattens the image patches into a one-dimensional vector to create a sequence of patch embeddings analogous to text tokens (e.g., in context of natural language processing).

102 Furthermore, the generative AI digital visual systemgenerates an image token by performing a linear projection to transform an image patch into a fixed-dimensional embedding (d-dimensional vector). In particular, the generated image tokens act as input tokens for the diffusion transformer model.

13 FIG. 12 FIG. 1300 1300 1200 102 shows an example of a methodfor media generation according to aspects of the present disclosure. In some examples, methoddescribes an operation of the diffusion transformer model such as an application of the diffusion modeldescribed with reference to. In some examples, these operations are performed by a system including a processor executing a set of codes to control functional elements of an apparatus such as the generative AI digital visual systemdescribed above.

1300 Additionally, or alternatively, steps of the methodmay be performed using special-purpose hardware. Generally, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein are composed of various substeps or are performed in conjunction with other operations.

1305 At operation, a user provides a text and/or visual prompt describing content to be included in a generated media item. For example, a user may provide the prompt “a person playing with a cat”. In some examples, guidance can be provided in a form other than text, such as via an image (e.g., a visual prompt), a sketch, an audio input, or a layout.

1310 At operation, the system converts the text prompt (or other prompt guidance) into tokens or other multi-dimensional representation compatible with a single stream diffusion transformer model. For example, text may be converted into a vector or a series of vectors using a transformer model, or a multi-modal encoder. In some cases, the encoder for the generation of tokens is trained independently of the diffusion model (e.g., via a trained dual-VAE model, which is described in DUAL-VAE FOR MORE EFFICIENT AND EFFECTIVE DIFFUSION MODEL TRAINING, and is incorporated by reference above).

1315 1320 At operation, a noise map is initialized that includes random noise. The noise map may be in a pixel space or a latent space. By initializing a media item with random noise, different variations of a media item including the content described by the prompt can be generated. At operation, the system generates a media item based on the noise map, tokens from the prompt (e.g., text prompt and/or visual prompt), and additional spatial-temporal positional encodings.

14 FIG. 14 FIG. 14 FIG. 12 FIG. 1400 1400 1200 shows a diffusion processaccording to aspects of the present disclosure. Specifically,provides additional details of operating principles for a diffusion model. Accordingly,provides context and details for the principles borrowed from diffusion models to help operate diffusion transformer models. In some examples, diffusion processdescribes an operation of the diffusion transformer model, such as the denoising process of a diffusion modeldescribed with reference to.

12 FIG. 1410 1410 1410 t-1 t As described above with reference to, using a diffusion transformer model can involve a process for initializing noise (e.g., generating noised tokens in a latent space) and a denoising processfor denoising the noised tokens to obtain denoised tokens. The denoising processcan be represented as p(x|x). In some cases, a neural network is trained to perform the denoising process(i.e., to successively remove the noise).

0 1 T 1:T 0 1 T 0 In an example forward process for a latent diffusion model, the model maps an observed variable x(an embedding in a latent space) intermediate variables x, . . . , xusing a Markov chain. The Markov chain gradually adds Gaussian noise to the data (e.g., the embedding, such as a visual signal) to obtain the approximate posterior q(x|x) as the latent variables are passed through a neural network such as a diffusion transformer model, where x, . . . , xhave the same dimensionality as x.

1410 1410 1410 t-1 t t t-1 T 0 The neural network may be trained to perform the denoising process. During the denoising process, the model begins with noisy data XT, such as a noisy token and denoises the data to obtain the p(x|x). At each step t−1, the denoising processtakes x, such as first intermediate denoised token, spatial-temporal positional encodings, and tokens (e.g., representing a prompt). Here, t represents a transformer block in a sequence of transformer blocks associated with different noise levels, The denoising processoutputs x, such as second intermediate denoised token iteratively until xreverts back to x, a completely denoised token. The denoising process can be represented as:

Moreover, the process of adding noise to data to generate noised tokens is expressed as the joint probability of a sequence of samples in the Markov chain can be written as a product of conditionals and the marginal probability:

T T where p(x)=N(x; 0, I) is the pure noise distribution as the reverse process takes the outcome of the forward process, a sample of pure noise, as input and

represents a sequence of Gaussian transitions corresponding to a sequence of addition of Gaussian noise to the sample.

0 0 1 T At interference time, observed data xin a pixel space can be mapped into a latent space as input and a generated data {tilde over (x)} is mapped back into the pixel space from the latent space as output (e.g., using a decoder of a trained dual-VAE model). In some examples, xrepresents an original clean token, latent variables x, . . . , xrepresent noisy tokens, and {tilde over (x)} represents the generated item with high quality.

15 FIG. 1500 1500 1500 is a flow diagram depicting an algorithm as a step-by-step procedurein an example implementation of operations performable for training a machine-learning model. In one or more embodiments, the proceduredescribes an operation of the training component described for configuring a diffusion transformer model. The procedureprovides one or more examples of generating training data, use of the training data to train a machine-learning model, and use of the trained machine-learning model to perform a task.

1502 To begin in this example, a machine-learning system collects training data (block) that is to be used as a basis to train a machine-learning model, i.e., which defines what is being modeled. The training data is collectable by the machine-learning system from a variety of sources. Examples of training data sources include public datasets, service provider system platforms that expose application programming interfaces (e.g., social media platforms), user data collection systems (e.g., digital surveys and online crowdsourcing systems), and so forth. Training data collection may also include data augmentation and synthetic data generation techniques to expand and diversify available training data, balancing techniques to balance a number of positive and negative examples, and so forth.

1504 The machine-learning system is also configurable to identify features that are relevant (block) to a type of task, for which the machine-learning model is to be trained. Task examples include classification, natural language processing, generative artificial intelligence, recommendation engines, reinforcement learning, clustering, and so forth. To do so, the machine-learning system collects the training data based on the identified features and/or filters the training data based on the identified features after collection. The training data is then utilized to train a machine-learning model.

1506 1508 In order to train the machine-learning model in the illustrated example, the machine-learning model is first initialized (block). Initialization of the machine-learning model includes selecting a model architecture (block) to be trained. Examples of model architectures include neural networks, diffusion transformer models, transformer models, diffusion models, convolutional neural networks (CNNs), long short-term memory (LSTM) neural networks, generative adversarial networks (GANs), decision trees, support vector machines, linear regression, logistic regression, Bayesian networks, random forest learning, dimensionality reduction algorithms, boosting algorithms, deep learning neural networks, etc.

1510 1512 A loss function is also selected (block). The loss function is utilized to measure a difference between an output of the machine-learning model (i.e., predictions) and target values (e.g., as expressed by the training data) to be used to train the machine-learning model. Additionally, an optimization algorithmis selected that is to be used in conjunction with the loss function to optimize parameters of the machine-learning model during training, examples of which include gradient descent, stochastic gradient descent (SGD), and so forth.

1516 1514 Initialization of the machine-learning model further includes setting initial values (block) of the machine-learning model (block) examples of which includes initializing weights and biases of nodes to improve efficiency in training and computational resources consumption as part of training. Hyperparameters are also set that are used to control training of the machine learning model, examples of which include regularization parameters, model parameters (e.g., a number of layers in a neural network), learning rate, batch sizes selected from the training data, and so on. The hyperparameters are set using a variety of techniques, including use of a randomization technique, through use of heuristics learned from other training scenarios, and so forth.

1518 The machine-learning model is then trained using the training data (block) by the machine-learning system. A machine-learning model refers to a computer representation that can be tuned (e.g., trained and retrained) based on inputs of the training data to approximate unknown functions. In particular, the term machine-learning model can include a model that utilizes algorithms (e.g., using the model architectures described above) to learn from, and make predictions on, known data by analyzing training data to learn and relearn to generate outputs that reflect patterns and attributes expressed by the training data.

Examples of training types include supervised learning that employs labeled data, unsupervised learning that involves finding an underlying structures or patterns within the training data, reinforcement learning based on optimization functions (e.g., rewards and/or penalties), use of nodes as part of “deep learning,” and so forth. The machine-learning model, for instance, is configurable as including a plurality of nodes that collectively form a plurality of layers. The layers, for instance, are configurable to include an input layer, an output layer, and one or more hidden layers. Calculations are performed by the nodes within the layers through the hidden states through a system of weighted connections that are “learned” during training, e.g., through use of the selected loss function and backpropagation to optimize performance of the machine-learning model to perform an associated task.

1520 1520 1500 1518 As part of training the machine-learning model, a determination is made as to whether a stopping criterion is met (decision block), i.e., which is used to validate the machine-learning model. The stopping criterion is usable to reduce overfitting of the machine-learning model, reduce computational resource consumption, and promote an ability of the machine-learning model to address previously unseen data, i.e., that is not included specifically as an example in the training data. Examples of a stopping criterion include but are not limited to a predefined number of epochs, validation loss stabilization, achievement of a performance improvement threshold, whether a threshold level of accuracy has been met, or based on performance metrics such as precision and recall. If the stopping criterion has not been met (“no” from decision block), the procedurecontinues training of the machine-learning model using the training data (block) in this example.

1520 1522 If the stopping criterion is met (“yes” from decision block), the trained machine-learning model is then utilized to generate an output based on subsequent data (block). The trained machine-learning model, for instance, is trained to perform a task as described above and therefore once trained is configured to perform that task based on subsequent data received as an input and processed by the machine-learning model.

Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.

Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computer-executable instructions are non-transitory computer-readable storage media (devices). Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.

Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (“SSDs”) (e.g., based on RAM), Flash memory, phase-change memory (“PCM”), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.

A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media.

Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer-readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.

Computer-executable instructions comprise, for example, instructions and data which, when executed by a processor, cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In one or more embodiments, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction and then scaled accordingly.

A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“IaaS”). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.

16 FIG. 18 FIG. 14 FIG. 12 FIG. 1600 1600 1600 shows an example of a methodfor training a diffusion model according to aspects of the present disclosure. In some embodiments, the methoddescribes an operation of a training component described for configuring a diffusion transformer model as described with reference to. The methodrepresents an example for training a reverse diffusion process as described above with reference to. In some examples, these operations are performed by a system including a processor executing a set of codes to control functional elements of an apparatus, such as the guided diffusion model described in.

1600 Additionally, or alternatively, certain processes of methodmay be performed using special-purpose hardware. Generally, these operations are performed according to the methods and processes described in accordance with aspects of the present disclosure. In some cases, the operations described herein are composed of various substeps or are performed in conjunction with other operations.

1605 At operation, the user initializes an untrained model. Initialization can include defining the architecture of the model and establishing initial values for the model parameters. In some cases, the initialization can include defining hyper-parameters such as the number of layers, the resolution and channels of each layer blocks, the location of skip connections, and the like.

1610 At operation, the system adds noise to a media item using a forward diffusion process in N stages. In some cases, the forward diffusion process is a fixed process where Gaussian noise is successively added to media item. In latent diffusion models (e.g., the token space), the Gaussian noise may be successively added to features in a latent space.

1615 At operation, the system at each stage n, starting with stage N, a reverse diffusion process is used to predict the output or features at stage n−1. For example, the reverse diffusion process can predict the noise that was added by the forward diffusion process, and the predicted noise can be removed from the noise input to obtain the predicted output. In some cases, an original media item is predicted at each stage of the training process.

1620 At operation, the system compares predicted output (or features) at stage n−1 to an actual media item (or features), such as the output at stage n−1 or the original input. For example, given observed data x, the diffusion model may be trained to minimize the variational upper bound of the negative log-likelihood-log pe (x) of the training data.

1625 At operation, the system updates parameters of the model based on the comparison. For example, parameters of a U-Net may be updated using gradient descent. Time-dependent parameters of the Gaussian transitions can also be learned. However, in some embodiments, for the diffusion transformer model, the system updates parameters of each transformer block using a mean square error denoising loss.

17 FIG. 1700 1700 102 1700 1705 1710 1715 1720 1725 1730 shows an example of a computing deviceaccording to aspects of the present disclosure. The computing devicemay be an example of the generative AI digital media system apparatus (e.g., an apparatus for interacting with the generative AI digital visual system, which is described above). In one aspect, computing deviceincludes processor(s), memory subsystem, communication interface, I/O interface, user interface component(s), and channel.

1700 102 1700 1705 1710 In one or more embodiments, computing deviceis an example of, or includes aspects of, the generative AI digital visual systemdescribed above. In one or more embodiments, computing deviceincludes one or more processorsthat can execute instructions stored in memory subsystemto perform media generation.

1700 1705 According to some aspects, computing deviceincludes one or more processors. In some cases, a processor is an intelligent hardware device, (e.g., a general-purpose processing component, a digital signal processor (DSP), a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or a combination thereof. In some cases, a processor is configured to operate a memory array using a memory controller. In other cases, a memory controller is integrated into a processor. In some cases, a processor is configured to execute computer-readable instructions stored in a memory to perform various functions. In one or more embodiments, a processor includes special purpose components for modem processing, baseband processing, digital signal processing, or transmission processing.

1710 According to some aspects, memory subsystemincludes one or more memory devices. Examples of a memory device include random access memory (RAM), read-only memory (ROM), or a hard disk. Examples of memory devices include solid state memory and a hard disk drive. In some examples, memory is used to store computer-readable, computer-executable software including instructions that, when executed, cause a processor to perform various functions described herein. In some cases, the memory contains, among other things, a basic input/output system (BIOS) which controls basic hardware or software operation such as the interaction with peripheral components or devices. In some cases, a memory controller operates memory cells. For example, the memory controller can include a row decoder, column decoder, or both. In some cases, memory cells within a memory store information in the form of a logical state.

1715 1700 1730 1715 According to some aspects, communication interfaceoperates at a boundary between communicating entities (such as computing device, one or more user devices, a cloud, and one or more databases) and channeland can record and process communications. In some cases, communication interfaceis provided to enable a processing system coupled to a transceiver (e.g., a transmitter and/or a receiver). In some examples, the transceiver is configured to transmit (or send) and receive signals for a communications device via an antenna.

1720 1700 1620 1700 1720 1720 According to some aspects, I/O interfaceis controlled by an I/O controller to manage input and output signals for computing device. In some cases, I/O interfacemanages peripherals not integrated into computing device. In some cases, I/O interfacerepresents a physical connection or port to an external peripheral. In some cases, the I/O controller uses an operating system such as iOS®, ANDROID®, MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, LINUX®, or other known operating system. In some cases, the I/O controller represents or interacts with a modem, a keyboard, a mouse, a touchscreen, or a similar device. In some cases, the I/O controller is implemented as a component of a processor. In some cases, a user interacts with a device via I/O interfaceor via hardware components controlled by the I/O controller.

1725 1700 1725 1725 According to some aspects, user interface component(s)enable a user to interact with computing device. In some cases, user interface component(s)include an audio device, such as an external speaker system, an external display device such as a display screen, an input device (e.g., a remote-control device interfaced with a user interface directly or through the I/O controller), or a combination thereof. In some cases, user interface component(s)include a GUI.

18 FIG. 12 FIG. 1800 1800 1800 1805 1810 1815 1820 1825 1825 1815 1810 1825 1800 shows an example of a generative AI digital media system apparatusaccording to aspects of the present disclosure. generative AI digital media system apparatusmay include an example of, or aspects of, the diffusion model described with reference to. In one or more embodiments, generative AI digital media system apparatusincludes processor unit, memory unit, diffusion transformer model, I/O module, and training component. Training componentupdates parameters of the diffusion transformer modelstored in memory unit. In some examples, the training componentis located outside the generative AI digital media system apparatus.

1805 Processor unitincludes one or more processors. A processor is an intelligent hardware device, such as a general-purpose processing component, a digital signal processor (DSP), a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or any combination thereof.

1805 1805 1805 1810 1805 1805 17 FIG. In some cases, processor unitis configured to operate a memory array using a memory controller. In other cases, a memory controller is integrated into processor unit. In some cases, processor unitis configured to execute computer-readable instructions stored in memory unitto perform various functions. In some aspects, processor unitincludes special purpose components for modem processing, baseband processing, digital signal processing, or transmission processing. According to some aspects, processor unitcomprises one or more processors described with reference to.

1810 1805 Memory unitincludes one or more memory devices. Examples of a memory device include random access memory (RAM), read-only memory (ROM), or a hard disk. Examples of memory devices include solid state memory and a hard disk drive. In some examples, memory is used to store computer-readable, computer-executable software including instructions that, when executed, cause at least one processor of processor unitto perform various functions described herein.

1810 1810 1810 1810 1810 1710 17 FIG. In some cases, memory unitincludes a basic input/output system (BIOS) that controls basic hardware or software operations, such as an interaction with peripheral components or devices. In some cases, memory unitincludes a memory controller that operates memory cells of memory unit. For example, the memory controller may include a row decoder, column decoder, or both. In some cases, memory cells within memory unitstore information in the form of a logical state. According to some aspects, memory unitis an example of the memory subsystemdescribed with reference to.

1800 1805 1810 According to some aspects, generative AI digital media system apparatususes one or more processors of processor unitto execute instructions stored in memory unitto perform functions described herein. For example, the generative AI digital media system apparatus to perform the operations described in the aspects below.

1810 1815 1815 12 13 FIGS.- The memory unitmay include a diffusion transformer modeltrained to remove noise from noised tokens according to spatial-temporal positional encodings. For example, after training, the diffusion transformer modelmay perform inferencing operations as described with reference toto remove noise from noised tokens and generate media such as video and/or images.

1815 1815 In one or more embodiments, the diffusion transformer modelis an Artificial neural network (ANN). An ANN can be a hardware component or a software component that includes connected nodes (i.e., artificial neurons) that loosely correspond to the neurons in a human brain. Each connection, or edge, transmits a signal from one node to another (like the physical synapses in a brain). When a node receives a signal, it processes the signal and then transmits the processed signal to other connected nodes. Specifically, each transformer block of the diffusion transformer modelcan represent the connected nodes.

ANNs have numerous parameters, including weights and biases associated with each neuron in the network, which control the degree of connection between neurons and influence the neural network's ability to capture complex patterns in data. These parameters, also known as model parameters or model weights, are variables that determine the behavior and characteristics of a machine learning model. Accordingly, the multi-layer perceptrons within each transformer block of the diffusion transformer model represents various aspects of an ANN.

In some cases, the signals between nodes comprise real numbers, and the output of each node is computed by a function of its inputs. For example, nodes may determine their output using other mathematical algorithms, such as selecting the max from the inputs as the output, or any other suitable algorithm for activating the node. Each node and edge are associated with one or more node weights that determine how the signal is processed and transmitted. In some cases, nodes have a threshold below which a signal is not transmitted at all. In some examples, the nodes are aggregated into layers.

1815 The parameters of the diffusion transformer modelcan be organized into layers. Different layers perform different transformations on their inputs. The initial layer is known as the input layer and the last layer is known as the output layer. In some cases, signals traverse certain layers multiple times. A hidden (or intermediate) layer includes hidden nodes and is located between an input layer and an output layer. Hidden layers perform nonlinear transformations of inputs entered into the network. Each hidden layer is trained to produce a defined output that contributes to a joint output of the output layer of the ANN. Hidden representations are machine-readable data representations of an input that are learned from hidden layers of the ANN and are produced by the output layer. As the understanding of the ANN of the input improves as the ANN is trained, the hidden representation is progressively differentiated from earlier iterations.

1825 1815 1815 Training componentmay train the diffusion transformer model. For example, parameters of the diffusion transformer modelcan be learned or estimated from training data and then used to make predictions or perform tasks based on learned patterns and relationships in the data. In some examples, the parameters are adjusted during the training process to minimize a loss function or maximize a performance metric. The goal of the training process may be to find optimal values for the parameters that allow the machine learning model to make accurate predictions or perform well on the given task.

1815 Accordingly, the node weights can be adjusted to improve the accuracy of the output (i.e., by minimizing a loss which corresponds in some way to the difference between the current result and the target result). The weight of an edge increases or decreases the strength of the signal transmitted between nodes. For example, during the training process, an algorithm adjusts machine learning parameters to minimize an error or loss between predicted outputs and actual targets according to optimization techniques like gradient descent, stochastic gradient descent, or other optimization algorithms. Once the machine learning parameters are learned from the training data, the diffusion transformer modelcan be used to make predictions on new, unseen data (i.e., during inference).

1820 1800 1820 1815 1815 1820 1720 17 FIG. I/O modulereceives inputs from and transmits outputs of the generative AI digital media system apparatusto other devices or users. For example, I/O modulereceives inputs for the diffusion transformer modeland transmits outputs of the diffusion transformer model. According to some aspects, I/O moduleis an example of the I/O interfacedescribed with reference to.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06T G06T5/70 G06T5/60 G06T2207/10016 G06T2207/20084

Patent Metadata

Filing Date

January 23, 2025

Publication Date

March 12, 2026

Inventors

Tobias Hinz

Lior Shapira

Lakshya Lnu

Kevin Duarte

Ali Aminian

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search