US-10904476

Techniques for up-sampling digital media content

PublishedJanuary 26, 2021

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Techniques for automated up-sampling of media files are provided. In some examples, a title associated with a media file, a metadata file associated with the title, and the media file may be received. The media file may be partitioned into one or more scene files, each scene file including a plurality of frame images in a sequence. One or more up-sampled scene files may be generated, each corresponding to a scene file of the one or more scene files. An up-sampled media file may be generated by combining at least a subset of the one or more up-sampled scene files. Generating one or more up-sampled scene files may include identifying one or more characters in a frame image of the plurality of frame images, based at least in part on implementation of a facial recognition algorithm including deep learning features in a neural network.

Patent Claims

17 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A non-transitory computer-readable medium storing computer-executable instructions that, when executed by a processor of a computer system, cause the computer system to at least: receive a title of a media file; determine, based at least in part on the title, that an up-sampled media file is unavailable in a media database, the media database storing the media file; receive a metadata file associated with the title from a metadata database different from the media database; receive the media file from the media database; partition the media file into one or more scene files, each scene file comprising a plurality of frame images in a sequence; generate one or more up-sampled scene files, each corresponding to a scene file of the one or more scene files the one or more up-sampled scene files generated by: identifying one or more sub-regions of a frame image of the plurality of frame images using the metadata file; generating one or more up-sampled sub-regions at least in part by up-sampling the one or more sub-regions of the frame image using a first Generative Adversarial Network (GAN); defining a background region, the background region comprising a portion of the frame image excluding the one or more sub-regions; generating an up-sampled background region at least in part by up-sampling the background region of the frame image using a second GAN different from the first GAN; and generating an up-sampled frame image at least in part by combining the up-sampled background region with the one or more up-sampled sub-regions; and generate the up-sampled media file by combining at least a subset of the one or more up-sampled scene files.

2. The computer-readable medium of claim 1 , wherein partitioning the media file into one or more scene files comprises: generating an audio file and a video file from the media file; determining one or more scene transitions in the audio file at least based at least in part on audio volume; determining one or more scenes in the video file, each corresponding to a scene transition of the one or more scene transitions in the audio file; and generating one or more scene files, each corresponding to a scene of the one or more scenes in the video file.

3. The non-transitory computer-readable medium of claim 1 , wherein the metadata file comprises a plurality of metadata image files.

4. The non-transitory computer-readable medium of claim 3 , wherein generating one or more up-sampled scene files further comprises identifying one or more characters in the frame image of the plurality of frame images, based at least in part on implementation of a facial recognition algorithm including deep learning features in a neural network and the plurality of metadata image files.

5. The non-transitory computer-readable medium of claim 4 , wherein the one or more sub-regions are contained within a foreground region identified at least in part by recognizing one or more faces in the frame image, the one or more faces corresponding to the one or more characters.

6. The non-transitory computer-readable medium of claim 1 , wherein generating an up-sampled background region comprises: up-sampling the background region at least in part by using the second GAN, the second GAN being trained at least in part using one or more pairs of images at different pixel-resolutions, the second GAN receiving a set of frame images in a continuity window corresponding to a first number of frame images preceding the frame image in the sequence and a second number of frame images following the frame image in the sequence, the second GAN minimizing transient pixel up-sampling artifacts in the continuity window at least in part by applying an attention layer to the continuity window.

7. The non-transitory computer-readable medium of claim 5 , wherein generating one or more up-sampled sub-regions comprises generating an up-sampled foreground region using the foreground region at least in part by using the first GAN, the first GAN being trained at least in part using one or more pairs of images at different pixel-resolutions, the first GAN receiving a set of frame images in a continuity window corresponding to a first number of frame images preceding the frame image in the sequence and a second number of frame images following the frame image in the sequence, the first GAN minimizing pixel artifacts in the foreground region at least in part by applying an attention layer to the continuity window.

8. A system, comprising: a memory configured to store computer-executable instructions; and one or more processors in communication with the memory, and configured to execute the computer-executable instructions to at least: receive a title of a media file; determine, based at least in part on the title, that an up-sampled media file is unavailable in a media database, the media database storing the media file; receive a metadata file associated with the title from a metadata database different from the media database; receive the media file from the media database; partition the media file into one or more scene files, each scene file comprising a plurality of frame images in a sequence; generate one or more up-sampled scene files, each corresponding to a scene file of the one or more scene files the one or more up-sampled scene files generated by: identifying one or more sub-regions of a frame image of the plurality of frame images using the metadata file; generating one or more up-sampled sub-regions at least in part by up-sampling the one or more sub-regions of the frame image using a first Generative Adversarial Network (GAN); defining a background region, the background region comprising a portion of the frame image excluding the one or more sub-regions; generating an up-sampled background region at least in part by up-sampling the background region of the frame image using a second GAN different from the first GAN; and generating an up-sampled frame image at least in part by combining the up-sampled background region with the one or more up-sampled sub-regions; and generate the up-sampled media file by combining at least a subset of the one or more up-sampled scene files.

9. The system of claim 8 , wherein partitioning the media file into one or more scene files comprises: generating an audio file and a video file from the media file; determining one or more scene transitions in the audio file at least in part by identifying one or more quiet segments and one or more loud segments; and generating one or more scene files at least in part by partitioning the video file according to the one or more scene transitions in the audio file.

10. The system of claim 8 , wherein the metadata file comprises a plurality of metadata image files, such that the system processes the plurality of metadata image files to identify one or more characters in the one or more scene files.

11. The system of claim 10 , wherein the system identifies one or more characters at least in part by implementing a facial recognition algorithm including deep learning features in a neural network, to recognize characters in the one or more scene files at least in part by matching a facial region detected in a scene file of the one or more scene files with one or more of the plurality of metadata image files.

12. The system of claim 11 , wherein the one or more sub-regions are contained within a foreground region identified at least in part by recognizing one or more faces in the frame image the one or more faces corresponding to the one or more characters.

13. The system of claim 12 , wherein generating one or more up-sampled scene files comprises: performing spatio-temporal up-sampling on the background region; and performing face-respecting spatio-temporal up-sampling on the foreground region; wherein face-respecting super-resolution up-sampling on the foreground region is based at least in part on using a generative adversarial network trained at least in part using one or more pairs of images at different pixel-resolutions, wherein the generative adversarial network receives a set of frame images in a continuity window corresponding to a first number of frame images preceding the frame image in the sequence and a second number of frame images following the frame image in the sequence, and wherein the generative adversarial network minimizes pixel artifacts in the foreground region at least in part by applying an attention layer to the continuity window.

14. The system of claim 13 , wherein generating one or more up-sampled scene files further comprises implementing a discriminator model in the generative adversarial network, trained to minimize a loss function with respect to both the spatio-temporal up-sampling of the background region and the face-respecting spatio-temporal up-sampling of the foreground region.

15. The system of claim 13 , wherein generating one or more up-sampled scene files further comprises generating an up-sampled frame image by at least: applying a first weighting factor to an up-sampled foreground region; applying a second weighting factor to an up-sampled background region; and combining up-sampled foreground region and the up-sampled background region.

16. The system of claim 8 , wherein the first GAN is trained at least in part using one or more pairs of images at different pixel-resolutions.

17. The system of claim 8 , wherein the first GAN minimizes pixel artifacts in the one or more up-sampled sub-regions at least in part by applying an attention layer to a continuity window, the continuity window corresponding to a first number of frame images preceding the frame image in the sequence and a second number of frame images following the frame image in the sequence.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06T G06V H04N

Patent Metadata

Filing Date

December 12, 2019

Publication Date

January 26, 2021

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search