Method and Apparatus for Scene Segmentation for Three-Dimensional Scene Reconstruction

PublishedJune 17, 2025

Assigneenot available in USPTO data we have

InventorsYingen Xiong Christopher A. Peri

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for obtaining scene segmentation, the method comprising: obtaining, from an image sensor, image data of a real-world scene; obtaining, from a depth sensor, sparse depth data of the real-world scene; passing the image data to a first neural network to obtain one or more object regions of interest (ROIs) and one or more feature map ROIs, wherein each object ROI comprises at least one detected object; passing the image data and the sparse depth data to a second neural network to obtain one or more dense depth map ROIs; aligning the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs; and passing the aligned one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a fully convolutional network to obtain a segmentation of the real-world scene, wherein the segmentation contains one or more pixelwise predictions of one or more objects in the real-world scene; wherein aligning the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs comprises resizing, using an image-guided filter, at least some of the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a common size.

2. The method of claim 1, wherein the first neural network comprises: a first two-dimensional convolutional layer configured to receive the image data and output encoded image data; one or more multi-scale residual blocks each comprising one or more two-dimensional convolutional blocks and one or more concatenation blocks, each multi-scale residual block configured to receive the encoded image data and output one or more scale-dependent predictions of one or more detected objects in the image data; and a second two-dimensional convolutional layer configured to receive the encoded image data and output one or more feature map pyramids, the second two-dimensional convolutional layer comprising one or more second two-dimensional convolutional blocks and one or more second concatenation blocks.

3. The method of claim 1, wherein passing the image data and the sparse depth data to the second neural network comprises: passing the sparse depth data to a plurality of encoding and decoding layers to obtain one or more sparse depth maps; and passing the image data and the one or more sparse depth maps to an image-guided super-resolution stage to obtain the one or more dense depth map ROIs.

4. The method of claim 1, wherein aligning the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs further comprises: for each of the one or more object ROIs, mapping the object ROI to a corresponding one of the one or more feature map ROIs and to a corresponding one of the one or more dense depth map ROIs.

5. The method of claim 1, wherein the segmentation of the real-world scene comprises a semantic segmentation mask.

6. The method of claim 5, further comprising: for each of the one or more object ROIs, obtaining an object classification of the at least one detected object in the object ROI; and combining the obtained object classification with the semantic segmentation mask to obtain an instance segmentation of the real-world scene.

7. The method of claim 1, wherein the method is performed using at least one processing device of a battery-powered portable device.

8. An apparatus for obtaining scene segmentation, the apparatus comprising: an image sensor; a depth sensor; at least one processing device configured to: obtain, from the image sensor, image data of a real-world scene; obtain, from the depth sensor, sparse depth data of the real-world scene; pass the image data to a first neural network to obtain one or more object regions of interest (ROIs) and one or more feature map ROIs, wherein each object ROI comprises at least one detected object; pass the image data and the sparse depth data to a second neural network to obtain one or more dense depth map ROIs; align the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs; and pass the aligned one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a fully convolutional network to obtain a segmentation of the real-world scene, wherein the segmentation contains one or more pixelwise predictions of one or more objects in the real-world scene; wherein, to align the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs, the at least one processing device is configured to resize, using an image guided filter, at least some of the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a common size.

9. The apparatus of claim 8, wherein the first neural network comprises: a first two-dimensional convolutional layer configured to receive the image data and output encoded image data; one or more multi-scale residual blocks each comprising one or more two-dimensional convolutional blocks and one or more concatenation blocks, each multi-scale residual block configured to receive the encoded image data and output one or more scale-dependent predictions of one or more detected objects in the image data; and a second two-dimensional convolutional layer configured to receive the encoded image data and output one or more feature map pyramids, the second two-dimensional convolutional layer comprising one or more second two-dimensional convolutional blocks and one or more second concatenation blocks.

10. The apparatus of claim 8, wherein, to pass the image data and the sparse depth data to the second neural network, the at least one processing device is configured to: pass the sparse depth data to a plurality of encoding and decoding layers to obtain one or more sparse depth maps; and pass the image data and the one or more sparse depth maps to an image-guided super-resolution stage to obtain the one or more dense depth map ROIs.

11. The apparatus of claim 8, wherein, to align the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs, the at least one processing device is further configured to: for each of the one or more object ROIs, map the object ROI to a corresponding one of the one or more feature map ROIs and to a corresponding one of the one or more dense depth map ROIs.

12. The apparatus of claim 8, wherein the segmentation of the real-world scene comprises a semantic segmentation mask.

13. The apparatus of claim 12, wherein the at least one processing device is further configured to: for each of the one or more object ROIs, obtain an object classification of the at least one detected object in the object ROI; and combine the obtained object classification with the semantic segmentation mask to obtain an instance segmentation of the real-world scene.

14. The apparatus of claim 8, wherein the apparatus is a battery-powered portable device.

15. A non-transitory computer-readable medium containing instructions that, when executed by at least one processor of an apparatus comprising an image sensor and a depth sensor, cause the apparatus to: obtain, from the image sensor, image data of a real-world scene; obtain, from the depth sensor, sparse depth data of the real-world scene; pass the image data to a first neural network to obtain one or more object regions of interest (ROIs) and one or more feature map ROIs, wherein each object ROI comprises at least one detected object; pass the image data and the sparse depth data to a second neural network to obtain one or more dense depth map ROIs; align the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs; and pass the aligned one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a fully convolutional network to obtain a segmentation of the real-world scene, wherein the segmentation contains one or more pixelwise predictions of one or more objects in the real-world scene; wherein the instructions that when executed cause the apparatus to align the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs comprise instructions that when executed cause the apparatus to resize, using an image guided filter, at least some of the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs to a common size.

16. The non-transitory computer-readable medium of claim 15, wherein the first neural network comprises: a first two-dimensional convolutional layer configured to receive the image data and output encoded image data; one or more multi-scale residual blocks each comprising one or more two-dimensional convolutional blocks and one or more concatenation blocks, each multi-scale residual block configured to receive the encoded image data and output one or more scale-dependent predictions of one or more detected objects in the image data; and a second two-dimensional convolutional layer configured to receive the encoded image data and output one or more feature map pyramids, the second two-dimensional convolutional layer comprising one or more second two-dimensional convolutional blocks and one or more second concatenation blocks.

17. The non-transitory computer-readable medium of claim 15, wherein the instructions that when executed cause the apparatus to pass the image data and the sparse depth data to the second neural network comprise instructions that when executed cause the apparatus to: pass the sparse depth data to a plurality of encoding and decoding layers to obtain one or more sparse depth maps; and pass the image data and the one or more sparse depth maps to an image-guided super-resolution stage to obtain the one or more dense depth map ROIs.

18. The non-transitory computer-readable medium of claim 15, wherein the instructions that when executed cause the apparatus to align the one or more object ROIs, one or more feature map ROIs, and one or more dense depth map ROIs further comprise instructions that when executed cause the apparatus to: for each of the one or more object ROIs, map the object ROI to a corresponding one of the one or more feature map ROIs and to a corresponding one of the one or more dense depth map ROIs.

19. The non-transitory computer-readable medium of claim 15, wherein the segmentation of the real-world scene comprises a semantic segmentation mask.

20. The non-transitory computer-readable medium of claim 19, further containing instructions that when executed cause the apparatus to: for each of the one or more object ROIs, obtain an object classification of the at least one detected object in the object ROI; and combine the obtained object classification with the semantic segmentation mask to obtain an instance segmentation of the real-world scene.

Patent Metadata

Filing Date

Unknown

Publication Date

June 17, 2025

Inventors

Yingen Xiong

Christopher A. Peri

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search