Provided is a method of processing a video. The method includes obtaining first feature data from a first frame; obtaining one or more second feature data from one or more second frames, obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information, obtaining fifth feature data, based on the first feature data and the one or more fourth feature data, and generating a third frame based on the fifth feature data.
Legal claims defining the scope of protection, as filed with the USPTO.
obtaining first feature data from a first frame; obtaining one or more second feature data from one or more second frames; obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs comprises the first frame and a corresponding second frame among the one or more second frames; obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs comprises the first feature data and corresponding second feature data among the one or more second feature data; obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information; obtaining fifth feature data, based on the first feature data and the one or more fourth feature data; and generating a third frame based on the fifth feature data. . A method of processing a video, the method comprising:
claim 1 warping the one or more second feature data based on the one or more pieces of bi-directional motion information; converting the first feature data and the warped one or more second feature data into patch embeddings, wherein the patch embeddings comprise first patch embeddings into which the first feature data is converted and second patch embeddings into which the warped one or more second feature data are converted, and the first patch embeddings and the second patch embeddings each comprise a plurality of patches of a predefined size; performing attention on the patch embeddings; and obtaining the one or more third feature data based on a result of the attention. . The method of, wherein the obtaining of the one or more third feature data by performing the first feature processing respectively on the one or more feature pairs based on the one or more pieces of bi-directional motion information comprises:
claim 2 obtaining a query based on the first patch embeddings; obtaining a key and a value based on the second patch embeddings; calculating a weight based on the query and the key; and applying the weight to the value. . The method of, wherein the performing of the attention on the patch embeddings comprises:
claim 3 performing first attention on a patch-by-patch basis on a plurality of patches included in the patch embeddings; and performing second attention on a pixel-by-pixel basis on the plurality of patches included in the patch embeddings. . The method of, wherein the performing of the attention on the patch embeddings comprises:
claim 1 obtaining one or more transformation parameters based on the one or more pieces of bi-directional motion information; and obtaining the one or more fourth feature data through a predefined transformation operation on the one or more third feature data based on the one or more transformation parameters. . The method of, wherein the obtaining of the one or more fourth feature data by performing the second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information comprises:
claim 5 the predefined transformation operation comprises multiplying the one or more third feature data by the scale factor and adding the bias to a result of the multiplying. . The method of, wherein the one or more transformation parameters comprise a scale factor and a bias, and
claim 1 obtaining sixth feature data by performing third feature processing on the first feature data; and obtaining the fifth feature data, based on the one or more fourth feature data and the sixth feature data. . The method of, further comprising:
obtaining first feature data from a first frame; obtaining one or more second feature data from one or more second frames; obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs comprises the first frame and a corresponding second frame among the one or more second frames; obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs comprises the first feature data and corresponding second feature data among the one or more second feature data; obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information; obtaining fifth feature data, based on the first feature data and the one or more fourth feature data; and generating a third frame based on the fifth feature data. . A non-transitory computer-readable recording medium storing one or more instructions which, when executed by at least one processor, cause an electronic device to perform operations comprising:
at least one processor; and memory storing one or more instructions, wherein the one or more instructions, when executed by the at least one processor, cause the electronic device to perform operations comprising: obtaining first feature data from a first frame, obtaining one or more second feature data from one or more second frames, obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs comprises the first frame and a corresponding second frame among the one or more second frames, obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs comprises the first feature data and corresponding second feature data among the one or more second feature data, obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information, obtaining fifth feature data, based on the first feature data and the one or more fourth feature data, and generating a third frame based on the fifth feature data. . An electronic device comprising:
claim 9 warping the one or more second feature data based on the one or more pieces of bi-directional motion information; converting the first feature data and the warped one or more second feature data into patch embeddings, wherein the patch embeddings comprise first patch embeddings into which the first feature data is converted and second patch embeddings into which the warped one or more second feature data are converted, and the first patch embeddings and the second patch embeddings each comprise a plurality of patches of a predefined size; performing attention on the patch embeddings; and obtaining the one or more third feature data based on a result of the attention. . The electronic device of, wherein the obtaining of the one or more third feature data by performing the first feature processing respectively on the one or more feature pairs based on the one or more pieces of bi-directional motion information comprises:
claim 10 obtaining a query based on the first patch embeddings; obtaining a key and a value based on the second patch embeddings; calculating a weight based on the query and the key; and applying the weight to the value. . The electronic device of, wherein the performing of the attention on the patch embeddings comprises:
claim 11 performing first attention on a patch-by-patch basis on a plurality of patches included in the patch embeddings; and performing second attention on a pixel-by-pixel basis on the plurality of patches included in the patch embeddings. . The electronic device of, wherein the performing of the attention on the patch embeddings comprises:
claim 9 obtaining one or more transformation parameters based on the one or more pieces of bi-directional motion information; and obtaining the one or more fourth feature data through a predefined transformation operation on the one or more third feature data based on the one or more transformation parameters. . The electronic device of, wherein the obtaining of the one or more fourth feature data by performing the second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information comprises:
claim 13 the predefined transformation operation comprises multiplying the one or more third feature data by the scale factor and adding the bias to a result of the multiplying. . The electronic device of, wherein the one or more transformation parameters comprise a scale factor and a bias, and
claim 9 obtaining sixth feature data by performing third feature processing on the first feature data; and obtaining the fifth feature data, based on the one or more fourth feature data and the sixth feature data. . The electronic device of, wherein the operations further comprise:
Complete technical specification and implementation details from the patent document.
This application is a continuation of International Application No. PCT/KR2024/004154 designating the United States, filed on Apr. 1, 2024, in the Korean Intellectual Property Receiving Office and claiming priority to Korean Patent Application No. 10-2023-0045038, filed on Apr. 5, 2023, in the Korean Intellectual Property Office, Korean Patent Application No. 10-2024-0041986, filed on Mar. 27, 2024, in the Korean Intellectual Property Office, the disclosures of each of which are incorporated by reference herein in their entireties.
The present disclosure relates to a method and apparatus for processing
a video by using an artificial neural network.
As data traffic has increased exponentially with the development of computer technology, artificial intelligence (AI) technology has become an important trend driving future innovations. Because AI technology simulates human thinking, it is infinitely applicable to virtually all industries. Representative examples of Al technology include pattern recognition, machine learning, expert systems, artificial neural networks, natural language processing, etc.
Artificial neural networks model the characteristics of human biological nerve cells by using mathematical expressions, and use algorithms that mimic human learning abilities. Through these algorithms, the artificial neural networks are able to generate mapping between input data and output data, and the ability to generate such mapping may be referred to as the learning capability of an artificial neural network. Furthermore, neural networks have a generalization ability to generate, based on training results, correct output data with respect to input data that was not used during training.
An artificial neural network may be used for video processing. In particular, the artificial neural network may be used to remove noise or artifacts from video or increase the resolution of the video. Each frame that constitutes a video may contain information that appears repeatedly (e.g., objects, lines, or edges that are identical or similar in size, shape, and/or structure), and such repetitive information may be usefully used when processing the video. In addition, because adjacent frames in a video contain information about changes in time, common repetitive information may appear in the adjacent frames. Therefore, there is a need for a method capable of effectively and efficiently utilizing common repetitive information in adjacent frames during video processing.
According to an aspect of the disclosure, there is provided a method of processing a video, the method including: obtaining first feature data from a first frame; obtaining one or more second feature data from one or more second frames; obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs includes the first frame and a corresponding second frame among the one or more second frames; obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs includes the first feature data and corresponding second feature data among the one or more second feature data; obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information; obtaining fifth feature data, based on the first feature data and the one or more fourth feature data; and generating a third frame based on the fifth feature data.
The obtaining of the one or more third feature data by performing the first feature processing respectively on the one or more feature pairs based on the one or more pieces of bi-directional motion information may include: warping the one or more second feature data based on the one or more pieces of bi-directional motion information; converting the first feature data and the warped one or more second feature data into patch embeddings, wherein the patch embeddings may include first patch embeddings into which the first feature data is converted and second patch embeddings into which the warped one or more second feature data are converted, and the first patch embeddings and the second patch embeddings each may include a plurality of patches of a predefined size; performing attention on the patch embeddings; and obtaining the one or more third feature data based on a result of the attention.
The performing of the attention on the patch embeddings may include: obtaining a query based on the first patch embeddings; obtaining a key and a value based on the second patch embeddings; calculating a weight based on the query and the key; and applying the weight to the value.
The performing of the attention on the patch embeddings may include: performing first attention on a patch-by-patch basis on a plurality of patches included in the patch embeddings; and performing second attention on a pixel-by-pixel basis on the plurality of patches included in the patch embeddings.
The obtaining of the one or more fourth feature data by performing the second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information may include: obtaining one or more transformation parameters based on the one or more pieces of bi-directional motion information; and obtaining the one or more fourth feature data through a predefined transformation operation on the one or more third feature data based on the one or more transformation parameters.
The one or more transformation parameters may include a scale factor and a bias, and the predefined transformation operation may include multiplying the one or more third feature data by the scale factor and adding the bias to a result of the multiplying.
The method may include obtaining sixth feature data by performing third feature processing on the first feature data; and obtaining the fifth feature data, based on the one or more fourth feature data and the sixth feature data.
According to an aspect of the disclosure, there is provided a non-transitory computer-readable recording medium storing one or more instructions which, when executed by at least one processor, cause an electronic device to perform operations including: obtaining first feature data from a first frame; obtaining one or more second feature data from one or more second frames; obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs includes the first frame and a corresponding second frame among the one or more second frames; obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs includes the first feature data and corresponding second feature data among the one or more second feature data; obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information; obtaining fifth feature data, based on the first feature data and the one or more fourth feature data; and generating a third frame based on the fifth feature data.
According to an aspect of the disclosure, there is provided an electronic device including: at least one processor; and memory storing one or more instructions, wherein the one or more instructions, when executed by the at least one processor, cause the electronic device to perform operations including: obtaining first feature data from a first frame, obtaining one or more second feature data from one or more second frames, obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs includes the first frame and a corresponding second frame among the one or more second frames, obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs includes the first feature data and corresponding second feature data among the one or more second feature data, obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information, obtaining fifth feature data, based on the first feature data and the one or more fourth feature data, and generating a third frame based on the fifth feature data.
The obtaining of the one or more third feature data by performing the first feature processing respectively on the one or more feature pairs based on the one or more pieces of bi-directional motion information may include: warping the one or more second feature data based on the one or more pieces of bi-directional motion information; converting the first feature data and the warped one or more second feature data into patch embeddings, wherein the patch embeddings may include first patch embeddings into which the first feature data is converted and second patch embeddings into which the warped one or more second feature data are converted, and the first patch embeddings and the second patch embeddings each may include a plurality of patches of a predefined size; performing attention on the patch embeddings; and obtaining the one or more third feature data based on a result of the attention.
The performing of the attention on the patch embeddings may include: obtaining a query based on the first patch embeddings; obtaining a key and a value based on the second patch embeddings; calculating a weight based on the query and the key; and applying the weight to the value.
The performing of the attention on the patch embeddings may include: performing first attention on a patch-by-patch basis on a plurality of patches included in the patch embeddings; and performing second attention on a pixel-by-pixel basis on the plurality of patches included in the patch embeddings.
The obtaining of the one or more fourth feature data by performing the second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information may include: obtaining one or more transformation parameters based on the one or more pieces of bi-directional motion information; and obtaining the one or more fourth feature data through a predefined transformation operation on the one or more third feature data based on the one or more transformation parameters.
The one or more transformation parameters may include a scale factor and a bias, and the predefined transformation operation may include multiplying the one or more third feature data by the scale factor and adding the bias to a result of the multiplying.
The operations further may include: obtaining sixth feature data by performing third feature processing on the first feature data; and obtaining the fifth feature data, based on the one or more fourth feature data and the sixth feature data.
Throughout the present disclosure, the expression “at least one of a, b or c” indicates only a, only b, only c, both a and b, both a and c, both b and c, all of a, b, and c, or variations thereof.
The terms used in the present disclosure are selected from general terms currently widely used in the art by taking into account functions described in an embodiment, but may vary according to an intention of a technician engaged in the art, precedent cases, advent of new technologies, etc. Furthermore, some particular terms may be arbitrarily selected by the applicant, and in this case, the meaning of the selected terms will be described in detail in the relevant description of the disclosure. Thus, the terms used herein should be defined not by simple appellations thereof but based on the meaning of the terms together with the overall description of the present disclosure.
Although the terms, such as “first”, “second”, etc., may be used herein to describe various elements or components, these elements or components should not be limited by the terms. The terms are only used to distinguish one element or component from another element or component. For example, as used herein, a first element or component may be termed a second element or component without departing from the scope of an embodiment, and similarly, a second element or component may be termed a first element or component.
Furthermore, when a component is referred to as being “connected” or “coupled” to another component, it should be understood that the component may be directly connected or coupled to the other component, but may also be connected or coupled to the other component via another intervening component therebetween. On the other hand, when a component is referred to as being “directly connected” or “directly coupled” to another component, it should be understood that there is no other intervening component therebetween.
Unless the context clearly indicates otherwise, the singular forms “a, “an,” and “the” are to be understood to include a plurality of referents. Thus, for example, reference to “a component surface” may also include reference to one or more of such surfaces.
Singular expressions used herein are intended to include plural expressions as well unless the context clearly indicates otherwise. Terms used herein, including technical or scientific terms, are intended to have the same meaning as commonly understood by one of ordinary skill in the art described herein.
It will be further understood that the terms “comprises” and/or “includes” when used in the present disclosure, specify the presence of stated features, numbers, steps, operations, elements, components, or combinations thereof described herein, but do not preclude the presence or addition of one or more other features, numbers, steps, operations, elements, components, or combinations thereof.
Furthermore, in the present disclosure, for an element referred to as a ‘unit,’ a ‘module,’ or the like, two or more elements may be combined into a single element, or a single element may be divided into two or more elements according to subdivided functions. Furthermore, each element to be described below may further perform, in addition to its main functions, some or all of functions performed by another element, and some of the main functions of each element may be performed entirely by another element.
All functions or operations described herein may be processed by a single processor or a combination of processors. The processor or combination of processors is circuitry that performs processing, and may include circuitry such as an application processor (AP), a communication processor (CP), a graphics processing unit (GPU), a neural processing unit (NPU), a microprocessor unit (MPU), a system on chip (SoC), an integrated chip (IC), and the like.
In the present disclosure, functions related to artificial intelligence (AI) are performed via a processor and a memory. The processor may consist of one or a plurality of processors. In this case, the one or plurality of processors may be a general-purpose processor such as a central processing unit (CPU), an AP, a digital signal processor (DSP), etc., a dedicated graphics processor such as a GPU, a vision processing unit (VPU), etc., or a dedicated AI processor such as an NPU. The one or plurality of processors may process input data according to predefined operation rules or AI model stored in the memory. Alternatively, when the one or plurality of processors are a dedicated AI processor, the dedicated AI processor may be designed with a hardware structure specialized for processing a particular AI model.
The predefined operation rules or AI model are created via a training process. In this case, the creation via the training process means that the predefined operation rules or AI model set to perform desired characteristics (or purposes) are created by training a base AI model based on a large number of training data via a learning algorithm. The training process may be performed on an apparatus itself on which Al is performed according to the present disclosure, or via a separate server and/or system. Examples of a learning algorithm may include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.
An AI model may consist of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values and may perform neural network computations via calculations between a result of computations in a previous layer and the plurality of weight values. The plurality of weight values assigned to each of the plurality of neural network layers may be optimized by a result of training the AI model. For example, the plurality of weight values may be updated to reduce or minimize a loss or cost value obtained in the AI model during a training process. An artificial neural network may include a deep neural network (DNN), and may be, for example, but is not limited to, a convolutional neural network (CNN), a DNN, a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent DNN (BRDNN), or a deep Q-network (DQN).
In the present disclosure, a machine-readable storage medium may be provided in the form of a non-transitory storage medium. In this regard, the term ‘non-transitory storage medium’ only means that the storage medium does not include a signal (e.g., an electromagnetic wave) and is a tangible device, and the term does not differentiate between where data is semi-permanently stored in the storage medium and where the data is temporarily stored in the storage medium. For example, the ‘non-transitory storage medium’ may include a buffer for temporarily storing data.
In the present disclosure, it should be understood that blocks in each flowchart and combinations of flowcharts may be performed by one or more computer programs including computer-executable instructions. The one or more computer programs may be all stored in a single memory, or may be partitioned and stored in a number of different memories.
According to an embodiment, methods according to the present disclosure may be included in a computer program product when provided. The computer program product may be traded, as a product, between a seller and a buyer. The computer program product may be distributed in the form of a machine-readable storage medium (e.g., compact disc ROM (CD-ROM) or distributed (e.g., downloaded or uploaded) on-line via an application store or directly between two user devices (e.g., smartphones). For online distribution, at least a part of the computer program product (e.g., a downloadable app) may be at least transiently stored or temporally generated in a machine-readable storage medium such as a memory of a server of a manufacturer, a server of an application store, or a relay server.
An embodiment of the present disclosure will be described more fully hereinafter with reference to the accompanying drawings so that the one or more embodiments may be easily implemented by one of ordinary skill in the art of the present disclosure. However, the present disclosure may be implemented in different forms and should not be construed as being limited to an embodiment set forth herein.
1 FIG. 100 illustrates a video processing networkaccording to an embodiment of the present disclosure.
1 FIG. 100 10 20 30 Referring to, to process a video, the video processing networkmay take a first frameand one or more second framesas input, and output a third frame. Processing of the video may include, for example, but is not limited to, frame interpolation that generates and inserts a new frame between existing frames, denoising that removes noise such as blur, or super-resolution that converts a low-resolution (e.g., 1920×1080) video to a high-resolution (e.g., 3840×2160) video.
10 10 In an embodiment, the first frameis a target frame to be processed. For example, the first framemay be an image containing noise or artifacts, a low-resolution image, or a low-quality image.
20 10 20 10 10 20 10 10 20 In an embodiment, the one or more second framesmay be a reference frame used to process the first frame. For example, a reference frame may be referred to as an adjacent frame, a surrounding frame, a nearby frame, a neighboring frame, or a close frame. The one or more second framesmay include one or more frames that are consecutive to the first frame, but do not necessarily refer only to frames that are consecutive to the first frame. For example, the one or more second framesmay be one or more frames included in the same scene as the first frame. In this case, whether the first frameand the one or more second framesare included in the same scene may be identified based on meta information of the video.
1 FIG. 10 20 20 10 20 100 20 . t−2 t−1 t+1 t+2 In the present disclosure, for convenience of description, as illustrated in, an example is described in which the first frameis a t-th frame I, and the one or more second framesare a t−2-th frame I, a t−1-th frame I, a t+1-th frame I, and a t+2-th frame. I, but the number of the second framesand the order of the first frameand the one or more second framesare not limited thereto. For example, the video processing networkmay utilize the t−1-th frame and the t+2-th frame as the second frames.
30 10 100 30 10 10 10 In an embodiment, the third frameis a frame generated as a result of the first framebeing processed by the video processing network. For example, the third framemay be an image from which noise or artifacts are removed from the first frame, an image with a higher resolution than the first frame, or an image with a higher quality than the first frame.
100 110 120 130 140 100 100 1 FIG. 1 FIG. In an embodiment, the video processing networkmay include a feature extraction network, a motion estimation module, a feature processing network, and a video reconstruction network. However, the components of the video processing systemare not limited thereto, and the video processing systemmay not include some of the components illustrated in, and may further include other components in addition to those illustrated in.
110 110 10 20 110 t t−2 t−2 t+2 In an embodiment, the feature extraction networkmay extract (e.g., obtain) feature data from the input frames. For example, the feature extraction networkmay extract first feature data xfrom the first frame, and one or more second feature data x, x, and xfrom the one or more second frames. The feature extraction networkmay include one or more CNNs.
110 10 20 In an embodiment, the feature extraction networkmay be a single network that extracts the first feature data and the one or more second feature data respectively from the first frameand one or more second frames.
110 10 10 110 1 FIG. t . t−2 t−2 t−2 t−2 t+1 t+1 t+2 t+2 In an embodiment, the feature extraction networkmay include a plurality of feature extraction subnetworks corresponding to the first frameand the one or more second frames. For example, in the example of, the feature extraction networkmay include a first feature extraction subnetwork for extracting the first feature data xfrom the first frame I, a second feature extraction subnetwork for extracting the second feature data xfrom the second frame I, a third feature extraction subnetwork for extracting the second feature data xfrom the second frame I, a fourth feature extraction subnetwork for extracting the second feature data xfrom the second frame I, and a fifth feature extraction subnetwork for extracting the second feature data xfrom the second frame I. In this case, each of the plurality of feature extraction subnetworks may include one or more CNNs.
120 10 20 20 120 1 FIG. In an embodiment, the motion estimation modulemay perform bi-directional motion estimation on an input frame pair. In this case, the frame pair may include the first frameand a corresponding second frameamong the one or more second frames. For example, in the example of, the motion estimation modulemay perform bi-directional motion estimation on each of a first frame pair including a t-th frame and a t−2-th frame, a second frame pair including a t-th frame and a t−1-th frame, a third frame pair including a t-th frame and a t+1-th frame, and a fourth frame pair including a t-th frame and a t+2-th frame. In the present disclosure, a result of the bi-directional motion estimation may be referred to as bi-directional motion information.
t→(t+i) (t+i)→t 10 20 20 10 10 20 1 FIG. In an embodiment, bi-directional motion information may include first motion information ƒfrom the first frameto the second frame, and second motion information ƒfrom the second frameto the first frame. In the example of, i is −2, −1, +1, and +2. For example, the first motion information and the second motion information may include at least one of a vector, coordinates, or a transformation matrix representing a change in position of an object between the first frameand the second frame.
120 In an embodiment, the motion estimation modulemay be implemented as an artificial neural network model (e.g., a model that predicts an optical flow by using a CNN, a transformer-based model, or the like), or it may be implemented as an algorithm that does not use an artificial neural network (e.g., the Lucas-Kanade method, the Horn-Schunck algorithm, polynomial fitting, the Kalman filter, etc.).
130 130 In an embodiment, the feature processing networkmay process input feature data based on bi-directional motion information. The feature data input to the feature processing networkmay include the first feature data and the one or more second feature data.
130 130 1 FIG. t t−2 t t−1 t t+1 t t+2 In an embodiment, the feature processing networkmay obtain third feature data by performing feature processing, including attention, on a feature pair, based on bi-directional motion information. Here, a feature pair may include the first feature data and corresponding second feature data among the one or more second feature data. For example, in the example of, the feature processing networkmay perform feature processing, including attention, on a first feature pair including the first feature data xand the second feature data x, a second feature pair including the first feature data xand the second feature data x, a third feature pair including the first feature data xand the second feature data x, and a fourth feature pair including the first feature data xand the second feature data x.
In an embodiment, attention may include operations for obtaining projected feature data, referred to as query, key, and value, based on input feature data, calculating a weight corresponding to a correlation between the query and the key, and applying the weight to the value.
130 In an embodiment, the feature processing networkmay obtain fourth feature data by performing feature processing, including feature transformation, on third feature data, based on bi-directional motion information. In this case, the feature transformation may be understood as a process of modulating the corresponding third feature data based on the consistency of the bi-directional motion information between two frames. For example, the feature transformation may include obtaining a scale factor and a bias based on the bi-directional motion information, and transforming the third feature data based on the obtained scale factor and bias.
130 2 10 FIGS.to Examples of the structure and operation of the feature processing networkare described below with reference to.
140 30 140 140 130 t In an embodiment, the video reconstruction networkmay generate the third framebased on the input feature data. For example, the video reconstruction networkmay convert feature data in a feature domain into an image in an image domain. The feature data input to the video reconstruction networkmay be the fifth feature data {circumflex over (x)}obtained by the feature processing network.
140 140 140 The video reconstruction networkmay be implemented in various ways depending on the application. For example, for super-resolution, the video reconstruction networkmay include one or more up-convolution layers and one or more pixel shuffle layers. For example, in the case of denoising, the video reconstruction networkmay include a network based on a Multi-layer Perceptron (MLP).
2 FIG. 130 illustrates the feature processing networkaccording to an embodiment of the present disclosure.
2 FIG. 2 FIG. 130 211 212 213 214 220 231 232 233 234 240 250 260 130 211 212 213 214 220 231 232 233 234 Referring to, in an embodiment, the feature processing networkmay include one or more multi-frame matching modules,,,, and, one or more feature transformation modules,,, and, a concatenation layer, a convolution layer, and a summation layer. Althoughshows that the feature processing networkincludes the five multi-frame matching modules,,,, andand the four feature transformation modules,,, and, this is merely an example, and the number of multi-frame matching modules and the number of feature transformation modules are not limited thereto.
211 212 213 214 220 211 212 213 214 220 211 212 213 214 220 220 211 212 213 214 310 220 211 212 213 214 211 212 213 214 220 2 FIG. 3 FIG. In an embodiment, the one or more multi-frame matching modules,,,, andmay include first multi-frame matching modules,,, andand a second multi-frame matching module. As illustrated in, the first multi-frame matching modules,,, andmay perform feature processing respectively on the feature pairs described above, and the second multi-frame matching modulemay perform feature processing on the first feature data. Therefore, in an embodiment, the feature processing operations of the second multi-frame matching modulemay be the same as the remaining ones of the feature processing operations of each of the first multi-frame matching modules,,, and, other than an operation related to a feature pair (operation of a feature warping module (of)). However, the present disclosure is not limited thereto, and for example, the second multi-frame matching modulemay be implemented in the same manner as the first multi-frame matching modules,,, andto perform feature processing on a feature pair including two first feature data. Therefore, in the following description, the first multi-frame matching modules,,, andand the second multi-frame matching moduleare described without distinguishing between them.
211 212 213 214 220 In an embodiment, each of the multi-frame matching modules,,,, andmay obtain third feature data by performing feature processing, including attention, on a feature pair, based on bi-directional motion information. Here, the feature pair may include the first feature data and corresponding second feature data among one or more second feature data. For example, a feature pair may be a concatenated matrix of the first feature data and the second feature data.
211 t←(t−2) t t−2 (t−2)→t t→(t−2) For example, the multi-frame matching modulemay obtain third feature data {circumflex over (x)}by performing feature processing on a first feature pair including the first feature data xand second feature data xbased on bi-directional motion information ƒand ƒbetween a t-th frame and a t−2-th frame.
212 t←(t−i) t t−1 (t−1)→t t→(t−1) For example, the multi-frame matching modulemay obtain third feature data {circumflex over (x)}by performing feature processing on a second feature pair including the first feature data xand second feature data xbased on bi-directional motion information ƒand ƒbetween the t-th frame and a t−1-th frame.
213 t←(t+1) t i+1 (t+1)→t t→(t+1) For example, the multi-frame matching modulemay obtain third feature data. {circumflex over (x)}by performing feature processing on a third feature pair including the first feature data xand second feature data xbased on bi-directional motion information ƒand ƒbetween the t-th frame and a t+1-th frame.
214 t←(i+2) t+2 (t+2)→t t→(t+2) For example, the multi-frame matching modulemay obtain third feature data {circumflex over (x)}by performing feature processing on a fourth feature pair including the first feature data and second feature data xbased on bi-directional motion information ƒand ƒbetween the t-th frame and a t+2-th frame.
220 t←t t t←t t In addition, as described above, the multi-frame matching modulemay obtain third feature data {circumflex over (x)}by performing feature processing on the first feature data x, or obtain the third feature data {circumflex over (x)}by performing feature processing on a feature pair including two first feature data x.
211 212 213 214 220 3 8 FIGS.toB Examples of a detailed structure and operation of the multi-frame matching modules,,,, andare described below with reference to.
231 232 233 234 231 232 233 234 10 11 FIGS.and In an embodiment, each of the feature transformation modules,,, andmay obtain fourth feature data by applying feature transformation to third feature data based on bi-directional motion information. Examples of the structure and operation of the feature transformation modules,,, andare described below with reference to.
240 240 220 231 232 233 234 t←i In an embodiment, the concatenation layermay concatenate a plurality of input feature data in a channel direction. For example, the feature data input to the concatenation layermay include feature data (e.g., {circumflex over (x)}) output from the second multi-frame matching moduleand feature data output from the feature transformation modules,,, and.
250 250 250 240 130 250 250 130 2 FIG. In an embodiment, the convolution layermay perform a convolution operation between input feature data and a kernel included in the convolution layer. For example, the feature data input to the convolution layermay be feature data output from the concatenation layer. Althoughshows that the feature processing networkincludes one convolution layer, the number of convolution layersis not limited thereto, and the feature processing networkmay include two or more convolution layers.
260 250 140 1 FIG. In an embodiment, the summation layermay obtain fifth feature data x by performing an element-wise summation operation between output data from the convolution layerand the first feature data. The fifth feature data may be input to the video reconstruction network (of).
3 FIG. 300 illustrates a multi-frame matching moduleaccording to an embodiment of the present disclosure.
300 211 212 213 214 220 3 FIG. 2 FIG. The multi-frame matching moduleillustrated inmay be any one of the multi-frame matching modules,,,, andillustrated in.
3 FIG. 1 3 FIGS.to 300 t (t+i)→t t→(t+i) Referring to, in an embodiment, the multi-frame matching modulemay obtain third feature data by performing feature processing on a feature pair including first feature data xand second feature data based on bi-directional motion information ƒand ƒbetween the t-th frame and the t+i-th frame. In the examples of, i is −2, −1, +1, and +2.
300 310 320 330 340 350 In an embodiment, the multi-frame matching modulemay include the feature warping module, a patch embedding module, one or more feature matching modules, a normalization layer, and a patch un-embedding module.
310 310 10 20 In an embodiment, the feature warping modulemay warp input second feature data based on bi-directional motion information. Here, the warping may include, but is not limited to, similarity transformation, Euclidean transformation, affine transformation, projective transformation, etc. For example, the feature warping modulemay warp the second feature data so that a position of an element included in the second feature data corresponds to a position of an element in the first feature data, based on coordinates representing a change in a position of an object between the first frameand the second frame, which is included in the bi-directional motion information.
310 310 Because the first feature data and the second feature data are respectively extracted from different frames, the position of the same object in the first feature data and the second feature data may vary due to movement of the object or movement of a camera. The feature warping modulemay correct such positional transformation of the object. By warping the second feature data by the feature warping module, subsequent operations (e.g., attention, etc.) may be performed robustly to transformation of frames, such as movement of the object.
310 530 300 5 FIG. In an embodiment, the operation of the feature warping modulemay be performed in an attention module (of) rather than in the multi-frame matching module. Because feature warping is intended to ensure that an attention operation is robust to frame transformation, the feature warping may be performed at any stage, for example, before extracting keys and values from the second feature data.
220 310 2 FIG. Moreover, as described above, the second multi-frame matching module (of) may be configured not to include the feature warping module.
320 320 320 In an embodiment, the patch embedding modulemay convert input feature data into patch embeddings. For example, the patch embedding modulemay split the input feature data into a plurality of patches of a predefined size, and obtain patch embeddings by applying a linear transformation to the patches. The feature data input to the patch embedding modulemay include the first feature data and the warped second feature data. In an embodiment, the patch embeddings may include first patch embeddings obtained from the first feature data and second patch embeddings obtained from the second feature data.
330 330 320 330 4 FIG. In an embodiment, the one or more feature matching modulesmay perform feature processing on the input feature data. For example, the data input to the one or more feature matching modulesmay be patch embeddings obtained by patch embedding modules. An example of the structure and operation of the feature matching moduleis described below with reference to.
340 340 340 340 330 In an embodiment, the normalization layermay normalize the input data. For example, the data input to the normalization layermay be normalized so that the sum of the data is 1. However, the normalization method performed by the normalization layeris not limited thereto. The data input to the normalization layermay be data obtained as a result of the feature processing by the one or more feature matching modules.
350 350 320 350 340 t←(t+i) In an embodiment, the patch un-embedding moduleobtain third feature data {circumflex over (x)}by unembedding the input data. The operation of the patch un-embedding modulemay be understood as the inverse of the operation of the patch embedding module. The data input to the patch un-embedding modulemay be data obtained as a result of the normalization operation of the normalization layer.
4 FIG. 400 illustrates a feature matching moduleaccording to an embodiment of the present disclosure.
400 330 4 FIG. 3 FIG. The feature matching moduleillustrated inmay be any one of the one or more feature matching modulesillustrated in.
4 FIG. 400 410 420 430 Referring to, in an embodiment, the feature matching modulemay include one or more transformer layers, a convolution layer, and a summation layer.
410 410 310 320 410 310 320 410 in,1 in,2 in,1 t in,2 t+i 5 FIG. In an embodiment, the one or more transformer layersmay each perform feature processing on first input data Fand second input data F. The first input data Finput to the one or more transformer layersmay be data obtained as a result of the first feature data xbeing processed by the feature warping moduleand the patch embedding module. The second input data Finput to the one or more transformer layersmay be data obtained as a result of the second feature data xbeing processed by the feature warping moduleand the patch embedding module. An example of the structure and operation of the one or more transformer layersis described below with reference to.
420 420 420 410 400 420 420 400 4 FIG. In an embodiment, a convolution layermay perform a convolution operation between input feature data and a kernel included in the convolution layer. For example, the feature data input to the convolution layermay be feature data output from the one or more transformer layers. Althoughshows that the feature matching moduleincludes one convolution layer, the number of convolution layersis not limited thereto, and the feature matching modulemay include two or more convolution layers.
430 420 out,1 in,1 in,1 In an embodiment, the summation layermay obtain first output data Fby performing an element-wise summation operation between output data from the convolution layerand the first input data F. The first output data may be input as first input data Ffor a next feature matching module.
400 430 out,1 out,2 out,1 out,2 in,2 In an embodiment, the feature matching modulemay output the first output data Fand second output data F. The first output data Fmay be data obtained by the summation layer, and the second output data Fmay be the same as the second input data F.
5 FIG. 500 illustrates a transformer layeraccording to an embodiment of the present disclosure.
500 410 5 FIG. 4 FIG. The transformer layerillustrated inmay be any one of the one or more transformer layersillustrated in.
5 FIG. 500 510 520 530 540 550 560 570 580 Referring to, in an embodiment, the transformer layermay include a first normalization layer, a patch splitting module, an attention module, a patch merging module, a first summation layer, a second normalization layer, an MLP, and a second summation layer.
510 500 500 410 500 410 520 in,3 in,4 in,3 in,1 in,4 in,2 In an embodiment, the first normalization layermay normalize first input data Fand second input data Finput to the transformer layer. The first input data Finput to the transformer layermay be the first input data Finput to the one or more transformer layers. The second input data Finput to the transformer layermay be the second input data Finput to the one or more transformer layers. The normalized first input data and second input data may be input to the patch splitting module.
510 500 510 500 510 For example, the first normalization layermay normalize the first input data so that a sum of the first input data input to the transformer layeris 1. The first normalization layermay normalize the second input data so that a sum of the second input data input to the transformer layeris 1. However, the normalization method performed by the first normalization layeris not limited thereto.
520 520 510 520 510 in,3 in,4 In an embodiment, the patch splitting modulemay split each of the first input data and the second input data input thereto into a plurality of patches of a predefined size. The first input data input to the patch splitting modulemay be the first input data Fnormalized by the first normalization layer. The second input data input to the patch splitting modulemay be the second input data Fnormalized by the first normalization layer. In an embodiment, the size of a patch may be determined by considering hardware performance, memory size, etc. For example, the larger the size of patch, the greater the computational cost. In an embodiment, the shape of a patch may be square (i.e., M×M), but is not limited thereto, and may also be rectangular (i.e., M×N).
530 530 500 530 500 in,3 in,4 In an embodiment, the attention modulemay perform attention on first input data and second input data that are input thereto. The first input data input to the attention modulemay be patches obtained by normalizing and then splitting the first input data Finput to the transformer layer. The second input data input to the attention modulemay be patches obtained by normalizing and then splitting the second input data Finput to the transformer layer.
530 600 700 600 520 700 600 700 6 FIG. 7 FIG. 6 7 FIGS.and In an embodiment, the attention modulemay be one of a first attention module (of) or a second attention module (of). The first attention modulemay perform attention on each of the patches obtained via splitting by the patch splitting module. The second attention modulemay perform attention on each of the pixels included in each patch. Examples of the structures and operations of the first attention moduleand the second attention moduleare described below with reference to, respectively.
540 540 520 540 530 540 530 In an embodiment, the patch merging modulemay merge first input data input thereto and merge second input data. The operation of the patch merging modulemay be understood as the inverse of the operation of the patch splitting module. The first input data input to the patch merging modulemay be a result of the attention performed by the attention module. The second input data input to the patch merging modulemay be the second input data input to the attention module.
550 540 500 550 560 In an embodiment, the first summation layermay perform an element-wise summation operation between first output data from the patch merging moduleand first input data input to the transformer layer. Output data from the first summation layermay be input to the second normalization layer.
560 550 560 550 550 560 In an embodiment, the second normalization layermay normalize the output data from the first summation layer. For example, the second normalization layermay normalize the output data from the first summation layerso that a sum of the output data from the first summation layeris 1. However, the normalization method performed by the second normalization layeris not limited thereto.
570 560 570 570 570 In an embodiment, the MLPmay perform feature processing on the data normalized by the second normalization layer. The MLPmay include one or more fully connected layers and one or more activation functions. For example, the MLPmay include a first linear layer, a Gaussian Error Linear Unit (GELU) function, and a second linear layer, wherein the first linear layer and the second linear layer may each perform a multiplication operation between input data input thereto and a weight matrix. The types of activation functions included in the MLPare not limited to those described above, and various activation functions such as Sigmoid, Rectified Linear Unit (ReLU), Tanh, Leaky ReLu, Parametric ReLU (PReLU), and Exponential Linear Unit (ELU) may be used.
580 550 570 580 500 out,3 In an embodiment, the second summation layermay perform an element-wise summation operation between the output data from the first summation layerand output data from the MLP. The output data from the second summation layermay be first output data Ffrom the transformer layer.
500 500 540 500 510 520 540 500 500 out,4 In an embodiment, the transformer layermay output second output data F. The second output data from the transformer layermay be second output data from the patch merging module. Because the second input data input to the transformer layeris normalized by the first normalization layerand processed by the patch splitting moduleand the patch merging module, the second output data from the transformer layermay be the same as the normalized first input data input to the transformer layer.
6 FIG. 600 illustrates a first attention moduleaccording to an embodiment of the present disclosure.
6 FIG. 600 602 601 520 600 600 600 in,5 t in,6 t+i Referring to, the first attention modulemay perform attention on a patch-by-patch basis on a plurality of patchesinto which normalized feature datais split by the patch splitting module. For example, first input data Finput to the first attention modulemay be patches obtained by splitting the first feature data x, and second input data Finput to the first attention modulemay be patches obtained by splitting the second feature data x. In the present disclosure, the first attention modulemay be referred to as an inter-patch attention module.
602 602 Because the patchincludes a plurality of pixels, the patchmay include structural information such as lines, corners, patterns, etc., which are difficult to identify in individual pixels. When attention is performed on a patch-by-patch basis, a correlation between patches including similar structures may be calculated as a large value, and thus, structural information as well as color information (e.g., pixel intensity) of each pixel may be utilized.
600 610 620 630 640 650 660 670 680 690 In an embodiment, the first attention modulemay include a feature warping module, a first linear layer, a second linear layer, a third linear layer, a transpose function, a first multiplication layer, a softmax function, a second multiplication layer, and a fourth linear layer.
610 600 610 310 in,6 3 FIG. In an embodiment, the feature warping modulemay warp the second input data Finput to the first attention module, based on bi-directional motion information. Here, the warping may include, but is not limited to, similarity transformation, Euclidean transformation, affine transformation, projective transformation, etc. The feature warping modulemay perform an operation similar to that of the feature warping moduleof.
300 310 600 610 As described above, the feature warping may be performed at any stage before extracting keys and values from the second feature data, so when the multi-frame matching moduleincludes the feature warping module, the first attention modulemay not include the feature warping module.
620 600 620 620 In an embodiment, the first linear layermay obtain a query Q corresponding to the first input data by performing a multiplication operation between the first input data input to the first attention moduleand a weight matrix included in the first linear layer. In an embodiment, the first linear layermay include a 1×1 convolution layer.
630 600 610 630 630 In an embodiment, the second linear layermay obtain a key K corresponding to the second input data by performing a multiplication operation between the second input data input to the first attention module(or the second input data warped by the feature warping module) and a weight matrix included in the second linear layer. In an embodiment, the second linear layermay include a 1×1 convolution layer.
640 600 610 640 640 In an embodiment, the third linear layermay obtain a value V corresponding to the second input data by performing a multiplication operation between the second input data input to the first attention module(or the second input data warped by the feature warping module) and a weight matrix included in the third linear layer. In an embodiment, the third linear layermay include a 1×1 convolution layer.
650 T In an embodiment, the transpose functionmay transpose the key to generate a transposed key K.
660 In an embodiment, the first multiplication layermay perform an element-wise multiplication operation between the query and the transposed key.
680 660 670 660 670 670 In an embodiment, the second multiplication layermay perform an element-wise multiplication operation between the value and an output of the first multiplication layerto which the softmax functionis applied. The output of the first multiplication layerto which the softmax functionis applied may be understood as a weight representing a correlation between the query and the key, and the operation of the second multiplication layermay be understood as weighted summation of the weight and the value.
690 680 690 690 In an embodiment, the fourth linear layermay perform a multiplication operation between an output of the second multiplication layerand a weight matrix included in the fourth linear layer. In an embodiment, the fourth linear layermay include a 1×1 convolution layer.
7 FIG. 700 illustrates a second attention moduleaccording to an embodiment of the present disclosure.
7 FIG. 700 702 701 520 700 700 700 in,7 t in,8 t+i Referring to, the second attention modulemay perform attention on a patch-by-patch basis on a plurality of patchesinto which normalized feature datais split by the patch splitting module. For example, first input data Finput to the second attention modulemay be patches obtained by splitting the first feature data xand second input data Finput to the second attention modulemay be patches obtained by splitting the second feature data x. In the present disclosure, the second attention modulemay be referred to as an intra-patch attention module.
700 710 720 730 740 750 760 770 780 790 In an embodiment, the second attention modulemay include a feature warping module, a first linear layer, a second linear layer, a third linear layer, a transpose function, a first multiplication layer, a softmax function, a second multiplication layer, and a fourth linear layer.
710 700 710 310 in,8 3 FIG. In an embodiment, the feature warping modulemay warp the second input data Finput to the second attention module, based on bi-directional motion information. Here, the warping may include, but is not limited to, similarity transformation, Euclidean transformation, affine transformation, projective transformation, etc. The feature warping modulemay perform an operation similar to that of the feature warping moduleof.
300 310 700 710 As described above, the feature warping may be performed at any stage before extracting keys and values from the second feature data, so when the multi-frame matching moduleincludes the feature warping module, the second attention modulemay not include the feature warping module.
720 700 720 720 In an embodiment, the first linear layermay obtain a query Q corresponding to the first input data by performing a multiplication operation between the first input data input to the second attention moduleand a weight matrix included in the first linear layer. In an embodiment, the first linear layermay include a 1×1 convolution layer.
730 700 710 730 730 In an embodiment, the second linear layermay obtain a key K corresponding to the second input data by performing a multiplication operation between the second input data input to the second attention module(or the second input data warped by the feature warping module) and a weight matrix included in the second linear layer. In an embodiment, the second linear layermay include a 1×1 convolution layer.
740 700 710 740 740 In an embodiment, the third linear layermay obtain a value V corresponding to the second input data by performing a multiplication operation between the second input data input to the second attention module(or the second input data warped by the feature warping module) and a weight matrix included in the third linear layer. In an embodiment, the third linear layermay include a 1×1 convolution layer.
750 T In an embodiment, the transpose functionmay transpose the key to generate a transposed key K.
760 In an embodiment, the first multiplication layermay perform an element-wise multiplication operation between the query and the transposed key.
780 760 770 760 770 770 In an embodiment, the second multiplication layermay perform an element-wise multiplication operation between the value and an output of the first multiplication layerto which the softmax functionis applied. The output of the first multiplication layerto which the softmax functionis applied may be understood as a weight representing a correlation between the query and the key, and the operation of the second multiplication layermay be understood as weighted summation of the weight and the value.
790 780 790 790 In an embodiment, the fourth linear layermay perform a multiplication operation between an output of the second multiplication layerand a weight matrix included in the fourth linear layer. In an embodiment, the fourth linear layermay include a 1×1 convolution layer.
8 8 FIGS.A andB 600 700 illustrate examples of arrangements of the first attention modulesand the second attention modules, according to an embodiment of the present disclosure.
8 8 FIGS.A andB 4 FIG. 8 8 FIGS.A andB 810 600 820 700 810 820 410 810 820 are each a simplified diagram illustrating first transformer layersincluding the first attention modulesand second transformer layersincluding the second attention modules. The first transformer layersand the second transformer layersmay correspond to the one or more transformer layersillustrated in. In, the four transformer layersandare illustrated for convenience of description, but the number of transformer layers is not limited thereto.
810 820 810 820 810 820 8 FIG.A 8 FIG.B In an embodiment, the number of the first transformer layersand the second transformer layersand the order of arrangement thereof may be implemented in various ways. For example, as illustrated in, the transformer layers may be arranged in the stated order of the two first transformer layersand the two second transformer layers. Alternatively, as illustrated in, the transformer layers may be arranged so that the first transformer layersand the second transformer layersalternate.
810 The first transformer layerthat performs patch-wise attention may
t t t 820 820 810 810 820 810 820 update the first feature data xby utilizing a wider range of information than the second transformer layer. On the other hand, the second transformer layerthat performs pixel-wise attention may update the first feature data xby utilizing more detailed information than the first transformer layer. Therefore, either the first transformer layeror the second transformer layermay be selected depending on the range of information to be utilized to update the first feature data x. For example, the first transformer layeror the second transformer layermay be selected based on the performance of hardware (e.g., a graphics processing unit (GPU), memory, etc.) performing video processing, a designer's experience, etc.
9 FIG. 900 illustrates a feature transformation moduleaccording to an embodiment of the present disclosure.
900 231 232 233 234 9 FIG. 2 FIG. The feature transformation moduleillustrated inmay be any one of the feature transformation modules,,, andillustrated in.
9 FIG. 900 out,7 t←(t+2) t→(t+i) (t+i)→t Referring to, in an embodiment, the feature transformation modulemay obtain fourth feature data ƒby applying feature transformation to third feature data {circumflex over (x)}based on bi-directional motion information ƒand ƒ.
900 910 920 In an embodiment, the feature transformation modulemay include a parameter extraction networkand a transformation module.
910 920 910 t→(t+i) (t+i)→t In an embodiment, the parameter extraction networkmay obtain transformation parameters based on the bi-directional motion information ƒand ƒ. The transformation parameters are parameters associated with a transformation operation of the transformation module, and may include one or more parameters. The parameter extraction networkmay include one or more artificial neural networks.
920 In an embodiment, the transformation modulemay transform the third feature data through a predefined transformation operation based on a transformation parameter. For example, the predefined transformation operation may include, but is not limited to, similarity transformation, Euclidean transformation, affine transformation, projective transformation, etc. For example, the predefined transformation operation may include an operation of multiplying the transformation parameter by the third feature data, or an operation of adding the transformation parameter to the third feature data.
900 900 The operation of the feature transformation moduledescribed above may be understood as a process of determining the usability of information extracted from a reference frame (e.g., a second frame) by utilizing bi-directional motion information. The bi-directional motion information may be interpreted as a result of matching two frames (e.g., a first frame and a second frame) by taking into account a surrounding area for each pixel. For example, when there is a difference in information between the two frames, such as when an object that exists in the first frame does not exist in the second frame, bi-directional motion information may be calculated inconsistently. The inconsistent bi-directional motion information may imply that the information that differs between the two frames is either very important information or very unimportant information for video processing. The feature transformation modulemay generate weights (e.g., transformation parameters) to be applied to feature data (e.g., second feature data) extracted from the reference frame via training, and the usability of information extracted from the reference frame may be determined according to the generated weights.
900 10 FIG. An example of implementation of the feature transformation moduleis described with reference to.
10 FIG. 1000 illustrates a feature transformation moduleaccording to an embodiment of the present disclosure.
1000 900 1010 1020 910 920 10 FIG. 9 FIG. 10 FIG. 9 FIG. The feature transformation moduleillustrated inis an example of implementation of the feature transformation moduleillustrated in. A parameter extraction networkand a transformation moduleillustrated inmay respectively correspond to the parameter extraction networkand the transformation moduleillustrated in.
10 FIG. 1010 1011 1012 1013 1014 1015 Referring to, the parameter extraction networkmay include a condition network, first and second convolution layersand, and third and fourth convolution layersand.
1011 1011 1011 1011 In an embodiment, the condition networkmay obtain intermediate features by performing feature processing on bi-directional motion information. The condition networkmay include one or more convolution layers and one or more activation functions. For example, the condition networkmay consist of alternating two-dimensional (2D) convolution layers and LeakyReLU activation functions. However, the configuration of the condition networkis not limited thereto. In the present disclosure, the intermediate features may be referred to as a confidence mask.
1012 1013 In an embodiment, the first and second convolution layersandmay obtain a scale factor by performing a convolution operation on the confidence mask.
1014 1015 In an embodiment, the third and fourth convolution layersandmay obtain a bias by performing a convolution operation on the confidence mask.
10 FIG. It is described with reference tothat two pairs of convolution layers are used to obtain the scale factor and the bias, respectively, but the type and number of neural networks are not limited thereto.
1020 In an embodiment, the transformation modulemay obtain the fourth feature data by multiplying the third feature data by the scale factor and adding the bias to a result of the multiplication.
1000 10 FIG. In the present disclosure, the operation of the feature transformation moduleillustrated inmay be referred to as spatial feature transformation.
11 FIG. 12 FIG. 1100 1100 1200 is a flowchart of a video processing methodaccording to an embodiment of the present disclosure. The video processing methodmay be performed by an electronic deviceillustrated in.
1110 1200 In operation, the electronic devicemay extract first feature data from a first frame.
1120 1200 In operation, the electronic devicemay extract one or more second feature data from one or more second frames.
1110 1120 110 Operationsandmay correspond to operations of the feature extraction network.
1130 1200 1130 120 In operation, the electronic devicemay obtain one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs. Here, each of the one or more frame pairs may include a first frame and a corresponding second frame among the one or more second frames. Operationmay correspond to an operation of the motion estimation module. The bi-directional motion information may include first motion information from the first frame to the second frame, and second motion information from the second frame to the first frame.
1140 1200 In operation, the electronic devicemay obtain one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information. Here, each of the one or more feature pairs may include the first feature data and corresponding second feature data among the one or more second feature data. The first feature processing may include attention.
1150 1200 In operation, the electronic devicemay obtain one or more fourth feature data by performing second feature processing respectively on the one or more third feature data based on the one or more pieces of bi-directional motion information. The second feature processing may include feature transformation.
1160 1200 In operation, the electronic devicemay obtain fifth feature data, based on the first feature data and the one or more fourth feature data.
1140 1160 130 Operationstomay correspond to operations of the feature processing network.
1170 1200 1170 140 In operation, the electronic devicemay generate a third frame based on the fifth feature data. Operationmay correspond to an operation of the video restoration network.
12 FIG. 1200 illustrates the electronic devicefor processing a video, according to an embodiment of the present disclosure.
1200 100 1200 1200 12 FIG. The electronic deviceillustrated inmay process a video by performing the operations of the video processing networkdescribed above. The video to be processed may be a video stored in the electronic device, or a video received by the electronic devicefrom an external device (e.g., a server of an over-the-top (OTT) service provider that provides video over the Internet, etc.).
12 FIG. 12 FIG. 1200 1210 1220 1200 1200 1200 Referring to, in an embodiment, the electronic devicemay include a processorand a memory. However, the components of the electronic deviceare not limited thereto, and the electronic devicemay include more components than those shown in. For example, the electronic devicemay further include a communication interface for transmitting and receiving data to and from an external device, and/or a display for displaying a video.
1210 1200 1210 1210 1210 In an embodiment, the processoris a component that controls a series of processes to cause the electronic deviceto operate as described in the present disclosure, and may consist of one or a plurality of processors. The one or plurality of processors included in the processormay be circuitry, such as an SoC, an IC, etc. The one or plurality of processors included in the processormay be a general-purpose processor such as a CPU, an MPU, an AP, a DSP, etc., a dedicated graphics processor such as a GPU and a VPU, a dedicated AI processor such as an NPU, or a dedicated communication processor such as a CP. When the one or the plurality of processors included in the processoris a dedicated AI processor, the dedicated AI processor may be designed with a hardware structure specialized for processing a particular AI model.
1210 1220 1220 1220 1210 1200 1210 In an embodiment, the processormay write data to the memoryor read data stored in the memory, and in particular, execute a program or at least one instruction stored in the memoryto process data according to predefined operation rules or AI models. Accordingly, the processormay perform the operations described in the present disclosure, and the operations described in the present disclosure as being performed by the electronic devicemay be considered as being performed by the processorunless otherwise specifically stated.
1220 1220 1210 1220 1220 1220 1210 1210 In an embodiment, the memoryis a component for storing various programs or data, and may include a storage medium, such as read-only memory (ROM), random access memory (RAM), a hard disk, compact disc ROM (CD-ROM), and a digital versatile disc (DVD), or a combination of storage media. The memorymay not exist separately, but may be configured to be included in the processor. The memorymay consist of volatile memory, non-volatile memory, or a combination of volatile memory and non-volatile memory. The memorymay store a program or at least one instruction for performing operations according to embodiments described in the present disclosure. The memorymay provide stored data to the processoraccording to a request from the processor.
According to an aspect of the present disclosure, a method of processing a video may include extracting first feature data from a first frame, extracting one or more second feature data from one or more second frames, obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs includes the first frame and a corresponding second frame among the one or more second frames, obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs includes the first feature data and corresponding second feature data among the one or more second feature data, obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information, obtaining fifth feature data, based on the first feature data and the one or more fourth feature data, and generating a third frame based on the fifth feature data.
In an embodiment, the obtaining of the one or more third feature data by performing the first feature processing respectively on the one or more feature pairs based on the one or more pieces of bi-directional motion information may include warping the one or more second feature data based on the one or more pieces of bi-directional motion information, converting the first feature data and the warped one or more second feature data into patch embeddings, wherein the patch embeddings include first patch embeddings into which the first feature data is converted and second patch embeddings into which the warped one or more second feature data are converted, and the first and second patch embeddings each include a plurality of patches of a predefined size, performing attention on the patch embeddings, and obtaining the one or more third feature data based on a result of the attention.
In an embodiment, the performing of the attention on the patch embeddings may include obtaining a query based on the first patch embeddings, obtaining a key and a value based on the second patch embeddings, calculating a weight based on the query and the key, and applying the weight to the value.
In an embodiment, the performing of the attention on the patch embeddings may include performing first attention on a patch-by-patch basis on a plurality of patches included in the patch embeddings, and performing second attention on a pixel-by-pixel basis on the plurality of patches included in the patch embeddings.
In an embodiment, the obtaining of the one or more fourth feature data by performing the second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information may include obtaining one or more transformation parameters based on the one or more pieces of bi-directional motion information, and obtaining the one or more fourth feature data through a predefined transformation operation on the one or more third feature data based on the one or more transformation parameters.
In an embodiment, the one or more transformation parameters may include a scale factor and a bias.
In an embodiment, the predefined transformation operation may include multiplying the one or more third feature data by the scale factor and adding the bias to a result of the multiplication.
In an embodiment, the obtaining of the fifth feature data based on the first feature data and the one or more fourth feature data may include concatenating the one or more fourth feature data, performing a convolution operation on the concatenated one or more fourth feature data, and adding the first feature data to a result of the convolution operation.
In an embodiment, the method may further include obtaining sixth feature data by performing third feature processing on the first feature data, and obtaining the fifth feature data, based on the one or more fourth feature data and the sixth feature data.
In an embodiment, the bi-directional motion information may include first motion information from the first frame to the corresponding second frame, and second motion information from the corresponding second frame to the first frame.
According to an aspect of the present disclosure, a computer-readable recording medium stores one or more instructions which, when executed by a computer, cause the computer to perform a method that may include extracting one or more second feature data from one or more second frames, obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs includes the first frame and a corresponding second frame among the one or more second frames, obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs includes the first feature data and corresponding second feature data among the one or more second feature data, obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information, obtaining fifth feature data, based on the first feature data and the one or more fourth feature data, and generating a third frame based on the fifth feature data.
1200 1210 1220 1210 1200 According to an aspect of the present disclosure, an electronic deviceincludes at least one processor, and a memorystoring one or more instructions, wherein the at least one processoris configured to execute the one or more instructions to cause the electronic deviceto perform operations that may include extracting one or more second feature data from one or more second frames, obtaining one or more pieces of bi-directional motion information respectively corresponding to one or more frame pairs, wherein each of the one or more frame pairs includes the first frame and a corresponding second frame among the one or more second frames, obtaining one or more third feature data by performing first feature processing respectively on one or more feature pairs based on the one or more pieces of bi-directional motion information, wherein each of the one or more feature pairs includes the first feature data and corresponding second feature data among the one or more second feature data, obtaining one or more fourth feature data by performing second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information, obtaining fifth feature data, based on the first feature data and the one or more fourth feature data, and generating a third frame based on the fifth feature data.
In an embodiment, the obtaining of the one or more third feature data by performing the first feature processing respectively on the one or more feature pairs based on the one or more pieces of bi-directional motion information may include warping the one or more second feature data based on the one or more pieces of bi-directional motion information, converting the first feature data and the warped one or more second feature data into patch embeddings, wherein the patch embeddings include first patch embeddings into which the first feature data is converted and second patch embeddings into which the warped one or more second feature data are converted, and the first and second patch embeddings each include a plurality of patches of a predefined size, performing attention on the patch embeddings, and obtaining the one or more third feature data based on a result of the attention.
In an embodiment, the performing of the attention on the patch embeddings may include obtaining a query based on the first patch embeddings, obtaining a key and a value based on the second patch embeddings, calculating a weight based on the query and the key, and applying the weight to the value.
In an embodiment, the performing of the attention on the patch embeddings may include performing first attention on a patch-by-patch basis on a plurality of patches included in the patch embeddings, and performing second attention on a pixel-by-pixel basis on the plurality of patches included in the patch embeddings.
In an embodiment, the obtaining of the one or more fourth feature data by performing the second feature processing on the one or more third feature data based on the one or more pieces of bi-directional motion information may include obtaining one or more transformation parameters based on the one or more pieces of bi-directional motion information, and obtaining the one or more fourth feature data through a predefined transformation operation on the one or more third feature data based on the one or more transformation parameters.
In an embodiment, the one or more transformation parameters may include a scale factor and a bias.
In an embodiment, the predefined transformation operation may include multiplying the one or more third feature data by the scale factor and adding the bias to a result of the multiplication.
In an embodiment, the obtaining of the fifth feature data based on the first feature data and the one or more fourth feature data may include concatenating the one or more fourth feature data, performing a convolution operation on the concatenated one or more fourth feature data, and adding the first feature data to a result of the convolution operation.
In an embodiment, the operations may further include obtaining sixth feature data by performing third feature processing on the first feature data, and obtaining the fifth feature data, based on the one or more fourth feature data and the sixth feature data.
In an embodiment, the bi-directional motion information may include first motion information from the first frame to the corresponding second frame, and second motion information from the corresponding second frame to the first frame.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 3, 2025
January 29, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.