Machine learning method that learns to convert 2D video to 3D video from a set of training examples. Uses machine learning to perform any or all of the 2D to 3D conversion steps of identifying and locating objects, masking objects, modeling object depth, generating stereoscopic image pairs, and filling gaps created by pixel displacement for depth effects. Training examples comprise inputs and outputs for the conversion steps. The machine learning system generates transformation functions that generate the outputs from the inputs; these functions may then be used on new 2D videos to automate or semi-automate the conversion process. Operator input may be used to augment the results of the machine learning system. Illustrative representations for conversion data in the training examples include object tags to identify objects and locate their features, Bézier curves to mask object regions, and point clouds or geometric shapes to model object depth.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A machine learning method of converting 2D video to 3D video, comprising: obtaining a training set comprising a plurality of conversions, each conversion comprising a 2D scene comprising one or more 2D frames; a corresponding 3D conversion dataset that describes conversion of said 2D scene to 3D, comprising inputs and outputs for 2D to 3D conversion steps, said 2D to 3D conversion steps comprising obtaining said one or more 2D frames; locating and identifying an object in one or more object frames within said one or more 2D frames, each object frame containing an image of at least a portion of said object; generating an object mask for said object in said one or more object frames, said object mask identifying one or more masked pixels representing said object in said one or more object frames; generating an object depth model that assigns a pixel depth to one or more of said one or more masked pixels; generating a stereoscopic image pair for each of said one or more object frames based on said object depth model, said stereoscopic image pair comprising a left image and a right image; and, generating one or more gap filling pixel values for one or more missing pixels in said left image or in said right image; training a machine learning system on said training set; obtaining a 2D video; applying said machine learning system to said 2D video to automatically perform one or more of said 2D to 3D conversion steps on said 2D video; and, accepting input from an operator to modify or complete one or more of said 2D to 3D conversion steps on said 2D video.
A machine learning method converts 2D video to 3D video. First, the method trains a machine learning system using a training set. This training set contains 2D video scenes paired with corresponding 3D conversion data. The conversion data includes input and output data for steps like object identification and localization within frames, object mask generation (identifying object pixels), object depth modeling (assigning depth to pixels), stereoscopic image pair generation (creating left/right images), and gap filling (generating pixel values for missing pixels). Once trained, the system processes a new 2D video by automatically performing one or more of these 2D to 3D conversion steps. A human operator can then modify or complete any of these steps.
2. The method of claim 1 wherein said machine learning system performs said generating an object mask for said object in said one or more object frames; and, said corresponding 3D conversion dataset comprises a masking input comprising an identity of said object; and, a location of one or more feature points of said object in said one or more object frames; and, a masking output comprising a path comprising one or more segments, each segment comprising a curve defined by one or more control points, wherein said path is a boundary of said object mask.
In the machine learning method of converting 2D video to 3D video, the machine learning system creates an object mask, where a corresponding 3D conversion dataset includes a masking input comprising an identity of the object and the location of feature points. The masking output is a path composed of segments, each a curve defined by control points, that forms the object mask's boundary.
3. The method of claim 1 wherein said machine learning system performs said generating an object depth model; and, said corresponding 3D conversion dataset comprises an object depth model input comprising said object mask; and, an object depth model output comprising one or more regions within said object mask; and, a planar or curved 3D surface associated with each of said one or more regions.
In the machine learning method of converting 2D video to 3D video, the machine learning system generates an object depth model, where the corresponding 3D conversion dataset includes an object mask as input. The depth model output comprises one or more regions within the object mask, each region associated with a planar or curved 3D surface.
4. The method of claim 1 wherein said machine learning system performs said generating an object depth model; and, said corresponding 3D conversion dataset comprises an object depth model input comprising said object mask; and, an object depth model output comprising a point cloud of 3D points, each of said 3D points associated with a pixel within said object mask.
In the machine learning method of converting 2D video to 3D video, the machine learning system generates an object depth model, where the corresponding 3D conversion dataset includes the object mask as input. The depth model output is a point cloud of 3D points, each associated with a pixel within the object mask.
5. The method of claim 1 wherein said machine learning system performs said generating one or more gap filling pixel values; said generating one or more gap filling pixel values comprises generating a clean plate frame from one or more of said one or more 2D frames; and, copying pixel values from said clean plate frame to said one or more missing pixels; and, said corresponding 3D conversion dataset comprises a clean plate input comprising one or more of said one or more 2D frames; and, a clean plate output comprising said clean plate frame associated with said one or more 2D frames.
In the machine learning method of converting 2D video to 3D video, the machine learning system generates gap-filling pixel values by creating a clean plate frame from existing 2D frames and copying pixel values from it to fill missing pixels. The corresponding 3D conversion dataset includes the original 2D frames as input, and the clean plate frame as the output, associated with the 2D frames.
6. The method of claim 1 wherein said machine learning system performs said generating an object mask for said object in said one or more object frames; said generating an object depth model; said generating one or more gap filling pixel values; and, wherein said generating one or more gap filling pixel values comprises generating a clean plate frame from one or more of said one or more 2D frames; and, copying pixel values from said clean plate frame to said one or more missing pixels; and, said corresponding 3D conversion dataset comprises a masking input comprising an identity of said object; and, a location of one or more feature points of said object in said one or more object frames; a masking output comprising a path comprising one or more segments, each segment comprising a curve defined by one or more control points, wherein said path is a boundary of said object mask; an object depth model input comprising said object mask; an object depth model output comprising one or more of a region model comprising one or more regions within said object mask; a planar or curved 3D surface associated with each of said one or more regions; and, a point cloud of 3D points, each of said 3D points associated with a pixel within said object mask; a clean plate input comprising one or more of said one or more 2D frames; and, a clean plate output comprising said clean plate frame associated with said one or more 2D frames.
In the machine learning method of converting 2D video to 3D video, the machine learning system performs object mask generation, object depth modeling, and gap filling. The gap filling process involves creating a clean plate frame from existing 2D frames and copying pixel values to fill missing pixels. The conversion dataset includes: object identity and feature point locations as mask inputs; a path of curves defining the mask boundary as mask output; the object mask as depth model input; region models (regions and their 3D surfaces) AND/OR a point cloud as depth model outputs; the original 2D frames as clean plate input; and the resulting clean plate frame as output.
7. The method of claim 1 , wherein said generating an object mask for said object in said one or more object frames comprises defining a 3D space associated with said one or more object frames; obtaining a 3D object model of said object; and, defining a position and orientation of said 3D object model in said 3D space that aligns said 3D object model with said image of at least a portion of said object in said one or more object frames; and, said assigns a pixel depth to one or more of said one or more masked pixels comprises associates a point in said 3D object model in said 3D space with each masked pixel; and, assigns a depth of said point in said 3D space to said pixel depth for the associated masked pixel.
In the machine learning method of converting 2D video to 3D video, the object mask generation involves defining a 3D space, obtaining a 3D object model, and positioning/orienting it within the 3D space to align with the object's image in the 2D frame. Pixel depth assignment involves associating a point in the 3D object model with each masked pixel, and then assigning the 3D point's depth in 3D space as the pixel's depth.
8. The method of claim 7 , wherein said obtaining a 3D object model of said object comprises obtaining 3D scanner data captured from said object; and, converting said 3D scanner data into said 3D object model.
In the machine learning method of converting 2D video to 3D video where object mask generation involves defining a 3D space and using a 3D object model, obtaining the 3D object model involves capturing 3D scanner data of the object and converting that data into the 3D object model.
9. The method of claim 8 , wherein said obtaining said 3D scanner data comprises obtaining data from a time-of-flight system or a light-field system.
In the machine learning method of converting 2D video to 3D video where object mask generation involves defining a 3D space and using a 3D object model derived from scanner data, the 3D scanner data is obtained using a time-of-flight or light-field system.
10. The method of claim 8 , wherein said obtaining said 3D scanner data comprises obtaining data from a triangulation system.
In the machine learning method of converting 2D video to 3D video where object mask generation involves defining a 3D space and using a 3D object model derived from scanner data, the 3D scanner data is obtained using a triangulation system.
11. The method of claim 8 , wherein said converting said 3D scanner data into said 3D object model comprises retopologizing said 3D scanner data to form said 3D object model from a reduced number of polygons or parameterized surfaces.
In the machine learning method of converting 2D video to 3D video where object mask generation involves defining a 3D space and using a 3D object model derived from scanner data, converting the 3D scanner data into the 3D object model involves retopologizing the data to form the model from fewer polygons or parameterized surfaces.
12. The method of claim 7 , further comprising dividing said 3D object model into object parts, wherein said object parts may have motion relative to one another; augmenting said 3D object model with one or more degrees of freedom that reflect said motion relative to one another of said object parts; and, determining values of each of said one or more degrees of freedom that align said image of said at least a portion of said object in a plurality of frames of said one or more object frames with said 3D object model modified by said values of said one or more degrees of freedom.
In the machine learning method of converting 2D video to 3D video where object mask generation involves defining a 3D space and using a 3D object model, the method further comprises dividing the 3D object model into parts capable of relative motion. The model is augmented with degrees of freedom representing this motion. The system then determines values for these degrees of freedom to align the modified 3D object model with the object's image across multiple frames.
13. The method of claim 12 , wherein said determining values of each of said one or more degrees of freedom comprises selecting one or more features in each of said object parts, each having coordinates in said 3D object model; determining pixel locations of said one or more features in said one or more object frames; and, calculating a position and orientation of one of said object parts and calculating said values of each of said one or more degrees of freedom to align a projection of said coordinates in said 3D model onto a camera plane with said pixel locations in said one or more object frames.
In the machine learning method of converting 2D video to 3D video, where a 3D object model is divided into parts with degrees of freedom, and values are assigned to the degrees of freedom for alignment, determining the values involves selecting features within each object part (each with coordinates in the 3D model). The pixel locations of these features are determined in the 2D frames. Then, the position, orientation, and degree of freedom values are calculated to align a projection of the 3D model coordinates onto a camera plane with the pixel locations in the frames.
14. The method of claim 13 , wherein said determining pixel locations of said one or more features in said one or more object frames comprises selecting said pixel locations in one or more key frames; and, tracking said features across one or more non-key frames using a computer.
In the machine learning method of converting 2D video to 3D video, where a 3D object model is divided into parts with degrees of freedom and pixel locations of the parts are tracked, determining the pixel locations of features involves selecting pixel locations in key frames and then tracking those features across non-key frames using a computer.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
December 14, 2015
March 28, 2017
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.