US-12568260-B2

Systems and methods for generating transitions between videos

PublishedMarch 3, 2026

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A method for generating a video transition sequence between a first and a second video is provided. The method first identifies and embeds starting and ending images for the video transition sequence into a latent vector space. The method then interpolates to generate an interpolation trajectory between the starting and ending image embeddings within the latent vector space. The method then reconstructs the interpolation trajectory from the latent vector space to a final video transition sequence in the original image vector space. This video transition sequence is then displayed so as to smoothly transition between two videos.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A method for generating a video transition sequence between a first video and a second video, the method comprising:

. The method of, wherein identifying the starting image and the ending image is based on at least one of a similarity between the starting image and the ending image, a proximity of the ending image to a first image of the second video, or a proximity of the starting image to a last image of the first video.

. The method of, wherein the one or more neural networks comprise a generative adversarial network (GAN) that comprises a discriminator network and a generator network.

. The method of, wherein the starting embedding and the ending embedding are generated using the generator network of the GAN.

. The method of, wherein the video transition sequence between the first video and the second video is obtained using the generator network of the GAN.

. The method of, wherein the generator network and the discriminator network of the GAN are pre-trained using videos associated with a video-streaming platform, and wherein the generator network is pre-trained using labels generated by the discriminator network of the GAN, a loss function applied to the labels, and an optimization algorithm.

. The method of, wherein generating the trajectory of points in the latent vector space comprises using at least one of a linear interpolation, a spherical linear interpolation, or a polynomial interpolation.

. The method of, further comprising using an interpolation polynomial, wherein one or more coefficients of the interpolation polynomial are randomly generated.

. The method of, wherein the video transition sequence between the first video and the second video has a frame rate of 30 frames per second or more.

. The method of, wherein the video transition sequence between the first video and the second video has a duration that is determined by a loading time of the second video, wherein the second video begins to load once the first video has ended.

. The method of, wherein the second video comprises at least one of a video suggested to a user or a video on a playlist for the user, wherein the video transition sequence between the first video and the second video is obtained while the first video is being displayed to the user and prior to receiving a user request for the second video, and wherein displaying the obtained video transition sequence is responsive to receiving the user request for the second video.

. The method of, further comprising:

. A system for generating a video transition sequence, comprising:

. The system of, wherein identifying the starting image and the ending image is based on at least one of a similarity between the starting image and the ending image, a proximity of the ending image to a first image of the second video, or a proximity of the starting image to a last image of the first video.

. The system of, wherein the one or more neural networks comprise a generative adversarial network (GAN) that comprises a discriminator network and a generator network.

. The system of, wherein the starting embedding and the ending embedding are generated using the generator network of the GAN.

. The system of, wherein the video transition sequence between the first video and the second video is obtained using the generator network of the GAN.

. The system of, wherein the generator network and the discriminator network of the GAN are pre-trained using videos associated with a video-streaming platform, and wherein the generator network is pre-trained using labels generated by the discriminator network of the GAN, a loss function applied to the labels, and an optimization algorithm.

. The system of, wherein to generate the trajectory of points in the latent vector space, the processing device is to use at least one of a linear interpolation, a spherical linear interpolation, or a polynomial interpolation.

. A non-transitory computer readable storage medium comprising instructions that, when executed by a processing device, causes the processing device to perform operations comprising:

Detailed Description

Complete technical specification and implementation details from the patent document.

The instant specification generally relates to systems and methods for generating transitions between two videos.

Today, a user can access an unprecedented amount of video data online. Many of these videos are much shorter than a full-length movie and associated with dedicated video-streaming platforms (e.g., YouTube, Twitch, Instagram, etc.). As such, a user may enter such a video-streaming platforms and proceed to self-select a sequence of several videos to watch in series (i.e., back-to-back), with few or limited breaks in between consecutive videos. A user can select a video, watch the video, and (usually at, or near, the end of the video) select the next video to play in the sequence of videos. The selected video can immediately begin to play once the previous video has ended. In such a way, a user can watch a sequence of videos, of any length, or time.

Although a video-streaming platform may recommend the next video for a user to watch, that does not necessarily mean a viewer will choose that video, or that consecutive videos will be related. As an example of a watched sequence of videos, a viewer can navigate from an educational lecture to a music video, to a cooking tutorial, in a short amount of time and using only several clicks. The ability to traverse a wide variety and type of videos, combined with a user's ability to control the sequence of that traversal, can provides a unique and diverse viewing experience.

The below summary is a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is intended neither to identify key or critical elements of the disclosure, nor delineate any scope of the particular embodiments of the disclosure or any scope of the claims. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

According to one aspect of the present disclosure, a method for generating a video transition sequence between a first video and a second video is provided. The method includes, identifying, for the video transition sequence, a starting image and an ending image, wherein the starting image is associated with the first video and the ending image is associated with the second video. The method further includes, generating, using a one or more neural networks, a starting embedding of the starting image in a latent vector space. The method further includes, generating, using the one or more neural networks, an ending embedding of the ending image in the latent vector space. The method further includes, interpolating between the starting embedding and the ending embedding to generate an embedded transition sequence in the latent vector space. The method further includes, obtaining, using the one or more neural networks, the video transition sequence between the first video and the second video using the embedded transition sequence in the latent vector space. The method further includes, displaying, on a graphical user interface, the obtained video transition sequence between the first video and the second video.

In some aspects, identifying the starting image and the ending image is based on at least one of: a similarity between the starting image and the ending image, a proximity of the ending image to a first image of the second video, or a proximity of the starting image to a last image of the first video.

In some aspects, the one or more neural networks comprises a generative network of a generative adversarial network (GAN) that comprises a discriminator network and the generator network.

In some aspects, the starting embedding and the ending embedding are generated using the generator network of the GAN.

In some aspects, the video transition sequence between the first video and the second video is obtained using the generator network of the GAN.

In some aspects, the generator network and the discriminator network of the GAN are pre-trained using videos associated with a video-streaming platform, and wherein the generator network is pre-trained using labels generated by the discriminator network of the GAN, a loss function applied to the labels, and an optimization algorithm.

In some aspects, interpolating between the starting image embedding and the ending image embedding comprises using at least one of a linear interpolation, a spherical linear interpolation, or a polynomial interpolation.

In some aspects, interpolating between the starting image embedding and the ending image embedding comprises use of an interpolation polynomial, wherein one or more coefficients of the interpolation polynomial are randomly generated.

In some aspects, the embedded transition sequence in the latent vector space comprises a plurality of embedded transition vectors in the latent vector space, wherein the video transition sequence comprises a plurality of transition sequence frames, and wherein obtaining the video transition sequence comprises reconstructing each vector of the plurality of embedded transition vectors in the latent vector space into a frame of the plurality of transition sequence frames.

In some aspects, the video transition sequence between the first video and the second video has a frame rate of 30 frames per second or more.

In some aspects, the video transition sequence between the first video and the second video has a duration that is determined by a loading time of the second video, wherein the second video begins to load once the first video has ended.

In some aspects, the second video comprises at least one of a video suggested to a user or a video on a playlist for the user, wherein the video transition sequence between the first video and the second video is obtained while the first video is being displayed to the user and prior to receiving a user request for the second video, and wherein displaying the obtained video transition sequence is responsive to receiving the user request for the second video.

In some aspects, the method further includes, obtaining, prior to receiving the user request for the second video, an additional video transition sequence between the first video and a third video.

According to one aspect of the present disclosure, a system for generating a video transition sequence is provided. The system includes a memory device, and a processing device communicatively coupled to the memory device. In some aspects, the processing device is to identify, for the video transition sequence, a starting image and an ending image. In some aspects, the starting image is associated with a first video and the ending image is associated with a second video. The processing device is to further generate, using a one or more neural networks, a starting embedding of the starting image in a latent vector space. The processing device is to further, generate, using the one or more neural networks, an ending embedding of the ending image in the latent vector space. The processing device is to further, interpolate between the starting embedding and the ending embedding to generate an embedded transition sequence in the latent vector space. The processing device is to further, obtain, using the one or more neural networks, the video transition sequence between the first video and the second video using the embedded transition sequence in the latent vector space. The processing device is to further, display, on a graphical user interface, the obtained video transition sequence between the first video and the second video.

In some aspects, identifying the starting image and the ending image is based on at least one of a similarity between the starting image and the ending image, a proximity of the ending image to a first image of the second video, or a proximity of the starting image to a last image of the first video.

In some aspects, the one or more neural networks comprises a generative network of a generative adversarial network (GAN) that comprises a discriminator network and the generator network.

In some aspects, the starting embedding and the ending embedding are generated using the generator network of the GAN.

In some aspects, the video transition sequence between the first video and the second video is obtained using the generator network of the GAN.

According to one aspect of the present disclosure, a non-transitory computer readable storage medium is provided. In some aspects, the computer readable storage medium includes instructions that, when executed by a processing device, causes the processing device to perform operations including, identifying, for the video transition sequence, a starting image and an ending image. In some aspects, the starting image is associated with a first video and the ending image is associated with a second video. The operations further include, generating, using a one or more neural networks, a starting embedding of the starting image in a latent vector space. The operations further include, generating, using the one or more neural networks, an ending embedding of the ending image in the latent vector space. The operations further include, interpolating between the starting embedding and the ending embedding to generate an embedded transition sequence in the latent vector space. The operations further include, obtaining, using the one or more neural networks, the video transition sequence between the first video and the second video using the embedded transition sequence in the latent vector space. The operations further include, displaying, on a graphical user interface, the obtained video transition sequence between the first video and the second video.

Existing video-streaming platforms share some challenges when delivering video content to a viewer. One such challenge is that flexibility in video selection, along with the wide range of accessible videos, can lead to a visual discontinuity in the sequence of videos. Such discontinuity may be in the video content, a video's sentiment, tone, color, language, or some other characteristics. For example, the video sequence selected by a viewer might transition from a video with dark, suspenseful, and thrilling qualities to one that is bright, light-hearted, and humorous, whether or not such a video has been recommended by the platform. Subsequently, a user may select a next video that is sharp, fast-paced, and action-packed, and so on. In such an example and in such a way, video content, and transitions between such video content, may be abrupt. In some cases, such abruptness can reach the point of jarring and can negatively affect the user experience. Accordingly, it is advantageous to improve continuity and coherency in this type of online video user experience.

In some cases, such a goal may be accomplished with smoother transitions between videos. Gradually introducing elements of the next video, while gradually exiting elements of the current video can provide a gentler and more natural experience for the user. Smoother transitions between consecutive and diverse video content could provide a more seamless and intuitive viewing experience, aiding in a user's cognitive and emotional adjustment between two videos.

Aspects and implementations of the present disclosure address these and other challenges of the modern video-streaming platforms by providing systems and techniques for autonomous generation of seamless transitions in this type of online video consumption experience. According to embodiments of the present disclosure, a computer program can identify a starting point for a video transition within a first video, identify an ending point for a video transition in an upcoming, second video, and generate a smooth video transition sequence between the starting and ending points, based on the combined elements of the first and second video content.

In some embodiments, both the first video and the second video can first be analyzed to identify root still images for starting and ending points of the video transition sequence to be generated, e.g., a still image near the end of the first video may be identified as a starting point and a still image near the first image in the second, upcoming video may be identified as an ending point.

Next, to generate the video transition sequence between the starting and ending still images, a pre-trained generator of a generative adversarial networks (GAN) (including a generator (G) network and a discriminator (D) network) can be used to generate latent vectors (e.g. image embeddings) representing both still images in a latent vector space. More specifically, the generator may be applied in a GAN inversion subprocess to the starting and ending still images to generate latent vector representations and embed the image data of both images into starting and ending latent space vector representations.

In some embodiments, after generating starting and ending latent space vector representations, an interpolation trajectory between the two latent vector representations may be determined. In some implementations, the computer program may discretize the interpolation trajectory and, in a series of increments, generate transitional latent space vector representations (e.g. interpolation points). Such transitional latent space vectors can then be reconstructed (e.g., mapped-back) to an original image vector space through the generator of the GAN to generate transitional still images corresponding to the video transition sequence between the first video and the second video. In some implementations, the generated transitional still images are of the same dimensions and resolutions as the first video and the second video.

A video transition sequence generated between any two videos can be displayed to the viewer using a suitable graphical user interface to implement a seamless transition between any two videos.

The benefits of the disclosed process include producing a video transition sequence that seamlessly fits in-between the two videos, while the transitions between the two videos is pleasing to the human eye. Generating and discretizing an interpolation trajectory within the latent vector space, as opposed to using still images, has the benefit stemming from morphing of the visual features, as opposed to merely interpolating pixels. This results in softer and smoother changes at the pixel-level that are rendered to the user (viewer).

In such a way, smoother transitions between videos can be displayed between videos in a video sequence. Such transitions can aid in a user's cognitive and emotional adjustment between videos, increasing continuity and coherency, and generally create a more seamless and pleasant viewing experience for a user.

illustrates an example system architecture capable of supporting a video processing module that generates a video transition sequence, in accordance with one embodiment of the present disclosure.

The system architectureA (also referred to as “system” herein) includes an input data store, a user interface (UI), and a server. Servercan include a video processing modulesoftware program including root image identification, latent vector generation, interpolation, and reconstruction.

In some embodiments, input data store, UI, and servermay be connected via a network (not explicitly shown in). In certain embodiments, the network may include a public network (e.g., the Internet), a private network (e.g., a local area network (LAN) or wide area network (WAN)), a wired network (e.g., Ethernet network), a wireless network (e.g., an 802.11 network or a Wi-Fi network), a cellular network (e.g., a Long Term Evolution (LTE) network), routers, hubs, switches, server computers, and/or a combination thereof.

In other embodiments, video processing module, input data store, and/or UImay all be a part of one computing device or one server, and transmit data internally (e.g., over a suitable bus/interconnect) and without the use of a network. The computing device may include cloud-based computers, data processing servers, personal computers (PCs), laptops, mobile phones, smart phones, tablet computers, netbook computers, network-connected televisions, rack mount servers, etc.

In some embodiments, video data for processing by the video processing modulemay be stored by data store. Input video datacan be transmitted from input data storeto server; an output video transition sequencecan be transmitted from serverto UI.

In some embodiments, a video transition sequence output by video processing modulemay be returned to the same memory that originated the video data, e.g., input data storemay be implemented as part of a common memory.

In some embodiments, any of input data storeis a persistent storage that is capable of storing data as well as data structures to tag, organize, and index the data. A data item may include audio data and/or video stream data, in accordance with embodiments described herein. Input data storemay be hosted by one or more storage devices, such as main memory, magnetic or optical storage-based disks, tapes or hard drives, network-attached storage (NAS), storage area network (SAN), and so forth. In some embodiments, input data storemay be a network-attached file server, while in other embodiments, data storemay be some other type of persistent storage such as an object-oriented database, a relational database, and so forth, that may be hosted on one or more different machines coupled the system architectureA via a network. In some embodiments, the data store(s)may store portions of video streams.

In some embodiments, any one of the associated servers, including server, may temporarily accumulate and store data until it is transferred to UIfor display, or data storefor permanent storage.

Further to the descriptions above, a user may be provided with controls allowing the user to make an election as to both if and when systems, programs, or features described herein may enable collection of user information (e.g., information about a user's social network, social actions, or activities, profession, a user's preferences, or a user's current location), and if the user is sent content or communications from a server. In addition, certain data can be treated in one or more ways before it is stored or used, so that personally identifiable information is removed. For example, a user's identity can be treated so that no personally identifiable information can be determined for the user, or a user's geographic location can be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a specific location of a user cannot be determined. Thus, the user can have control over what information is collected about the user, how that information is used, and what information is provided to the user.

In some embodiments, UImay be presented to a user as a method of displaying the video transition sequence. In some embodiments, UImay be a part of a personal client device, such as a laptop, phone, or tablet, etc. In some embodiments, UImay be a part of any client device, capable of displaying a video transition sequence to a user.

In some embodiments, video processing modulemay be a computer program executed by one or more processor(s)of server. In other embodiments, video processing modulemay be divided across multiple servers and executed by multiple processors. Other system architectures that are different from the one shown in, in one or more aspects, are within the scope of the instant disclosure and may perform a function that is similar to the function performed by system architectureA and video processing module.

In some embodiments, servermay deploy one or more processor(s), one or more memorydevices, a video processing modulefor performing root image identification, latent vector generation, interpolation, and reconstructionto the input video data.

In some embodiments, video processing modulemay process multiple video pairs and generate multiple corresponding transition sequences, in parallel. In a non-limiting example, a first video may be currently displayed on a user interface (e.g., being viewed by a user/viewer), and there may exist a number N of potential upcoming videos. For example, the viewer of the current video may be able to select from the N next videos to watch, which may include videos that are already placed on the viewer's playlist by the viewer or (as suggestions) by a media streaming service. In such an instance, the system and video processing modulemay process and generate multiple (e.g., 2, 3, . . . N) potential transition sequences for each potential, upcoming video, in advance of the viewer selecting a new video. In such a way, the media streaming platform can reduce latency by precomputing one or more video transition sequences, in anticipation of the viewer's decision to select a specific video.

In some embodiments, video transition sequencemay be displayed to a viewer of the online videos.

A general overview of certain operations of video processing module, explained with respect to, will now be provided. A more detailed description of these processes will be provided below with respect to.

illustrates an example flow of input video data through the video processing module of, the proposed methods in accordance with one embodiment of the present disclosure.

In some embodiments, video processing module(of) processes data according to dataflowB (seen in). In some embodiments, the video processing module and dataflow operate by first receiving video data, e.g., input video data. Input video datacan include two distinct videos for which the video processing module will create a transition sequence (i.e., create a transition sequence between the two videos).

Patent Metadata

Filing Date

Unknown

Publication Date

March 3, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search