Patentable/Patents/US-20250299431-A1

US-20250299431-A1

Method, Apparatus, and Electronic Device for Three-Dimensional Scene Generation

PublishedSeptember 25, 2025

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Embodiments of the present application disclose a method and an apparatus, and an electronic device for three-dimensional scene generation. A specific implementation of the method includes: obtaining a target text, and generating a panoramic image described by the target text; obtaining multi-view information in a plurality of preset views, and generating a multi-view image in the plurality of views with the panoramic image; performing depth estimation on the panoramic image to determine a sparse point cloud corresponding to the panoramic image; and generating, based on the multi-view image, the multi-view information, and the sparse point cloud, a three-dimensional scene model described by the target text.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A method for three-dimensional scene generation, comprising:

. The method according to, wherein the generating a panoramic image described by the target text comprises:

. The method according to, wherein the target diffusion model is a model obtained by performing a target operation on an original diffusion model, wherein the original diffusion model is used to represent a correspondence between a text and a two-dimensional image, and the target operation comprises: freezing a parameter of the original diffusion model, and inserting a learnable module into the original diffusion model, wherein the learnable module is configured to convert the two-dimensional image into the panoramic image.

. The method according to, wherein the learnable module comprises a low-rank matrix obtained by decomposing a parameter matrix of the original diffusion model using a low-rank adaptation technology.

. The method according to, wherein the method further comprises:

. The method according to, wherein the three-dimensional scene model comprises a three-dimensional Gaussian radiance field.

. The method according to, wherein the method further comprises:

. An electronic device, comprising:

. The device according to, wherein the programs causing the one or more processors to generate a panoramic image described by the target text comprises programs causing the one or more processors to:

. The device according to, wherein the target diffusion model is a model obtained by performing a target operation on an original diffusion model, wherein the original diffusion model is used to represent a correspondence between a text and a two-dimensional image, and the target operation comprises: freezing a parameter of the original diffusion model, and inserting a learnable module into the original diffusion model, wherein the learnable module is configured to convert the two-dimensional image into the panoramic image.

. The device according to, wherein the learnable module comprises a low-rank matrix obtained by decomposing a parameter matrix of the original diffusion model using a low-rank adaptation technology.

. The method according to, wherein the programs further cause the one or more processors to:

. The device according to, wherein the three-dimensional scene model comprises a three-dimensional Gaussian radiance field.

. The device according to, wherein the programs further cause the one or more processors to:

. A non-transitory computer-readable medium having a computer program stored thereon, wherein the program, when executed by a processor, causing the processor to perform:

. The medium according to, wherein the programs causing the processors to generate a panoramic image described by the target text comprises programs causing the processors to:

. The medium according to, wherein the target diffusion model is a model obtained by performing a target operation on an original diffusion model, wherein the original diffusion model is used to represent a correspondence between a text and a two-dimensional image, and the target operation comprises: freezing a parameter of the original diffusion model, and inserting a learnable module into the original diffusion model, wherein the learnable module is configured to convert the two-dimensional image into the panoramic image.

. The medium according to, wherein the learnable module comprises a low-rank matrix obtained by decomposing a parameter matrix of the original diffusion model using a low-rank adaptation technology.

. The medium according to, wherein the programs further cause the processors to:

. The medium according to, wherein the three-dimensional scene model comprises a three-dimensional Gaussian radiance field.

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims priority to Chinese Application No. 202410339067.1 filed Mar. 22, 2024, the disclosure of which is incorporated herein by reference in its entirety.

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, apparatus, and an electronic device for three-dimensional scene generation.

Artificial intelligence generated content (AIGC) refers to content generated by artificial intelligence. In terms of 3D scene generation, AIGC may be used to automatically create a realistic background environment. With the emergence of commercial mixed reality platforms and the rapid innovation of 3D graphics technologies, high-quality 3D scene generation has become one of the most important issues in computer vision. Generating a 3D scene background using AIGC has the advantages of being fast, efficient, customizable, creative, and versatile.

This section of the present disclosure is provided to give a brief overview of concepts, which will be described in detail later in the Detailed Description section. This section of the present disclosure is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to limit the scope of the claimed technical solution.

According to a first aspect, an embodiment of the present disclosure provides a method for three-dimensional scene generation. The method includes: obtaining a target text, and generating a panoramic image described by the target text; obtaining multi-view information in a plurality of preset views, and generating a multi-view image in the plurality of views with the panoramic image; performing depth estimation on the panoramic image to determine a sparse point cloud corresponding to the panoramic image; and generating, based on the multi-view image, the multi-view information, and the sparse point cloud, a three-dimensional scene model described by the target text.

According to a second aspect, an embodiment of the present disclosure provides an apparatus for three-dimensional scene generation. The apparatus includes: an obtaining unit configured to obtain a target text, and generate a panoramic image described by the target text; a first generation unit configured to obtain multi-view information in a plurality of preset views, and generate a multi-view image in the plurality of views with the panoramic image; a determination unit configured to perform depth estimation on the panoramic image to determine a sparse point cloud corresponding to the panoramic image; and a second generation unit configured to generate, based on the multi-view image, the multi-view information, and the sparse point cloud, a three-dimensional scene model described by the target text.

According to a third aspect, an embodiment of the present disclosure provides an electronic device. The electronic device includes: one or more processors; and a storage apparatus configured to store one or more programs, where the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for three-dimensional scene generation in the first aspect.

According to a fourth aspect, an embodiment of the disclosure provides a computer-readable medium storing a computer program. The computer program, when executed by a processor, causing the processor to perform the steps of the method for three-dimensional scene generation in the first aspect.

In the method and apparatus, and the electronic device for three-dimensional scene generation provided in the embodiments of the present disclosure, the target text is obtained, and the panoramic image described by the target text is generated; then, the multi-view information in the plurality of preset views is obtained, and the multi-view image in the plurality of views is generated with the panoramic image; next, depth estimation is performed on the panoramic image to determine the sparse point cloud corresponding to the panoramic image; and finally, the three-dimensional scene model described by the target text is generated based on the multi-view image, the multi-view information, and the sparse point cloud.

The embodiments of the present disclosure are described in more detail below with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure may be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the accompanying drawings and the embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the scope of protection of the present disclosure.

It should be understood that the various steps described in the method implementations of the present disclosure may be performed in different orders, and/or performed in parallel. Furthermore, additional steps may be included and/or the execution of the illustrated steps may be omitted in the method implementations. The scope of the present disclosure is not limited in this respect.

The term “include” used herein and the variations thereof are an open-ended inclusion, namely, “include but not limited to”. The term “based on” is “at least partially based on”. The term “an embodiment” means “at least one embodiment”. The term “another embodiment” means “at least one another embodiment”. The term “some embodiments” means “at least some embodiments”. Related definitions of the other terms will be given in the description below.

It should be noted that concepts such as “first” and “second” mentioned in the present disclosure are only used to distinguish different apparatuses, modules, or units, and are not used to limit the sequence of functions performed by these apparatuses, modules, or units or interdependence.

It should be noted that the modifiers “one” and “a plurality of” mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, the modifiers should be understood as “one or more”.

The names of messages or information exchanged between a plurality of apparatuses in the implementations of the present disclosure are used for illustrative purposes only, and are not used to limit the scope of these messages or information.

Reference is made to, which shows a processof an embodiment of a method for three-dimensional scene generation according to the present disclosure. The method for three-dimensional scene generation includes the following steps.

Step: Obtain a target text, and generate a panoramic image described by the target text.

In this embodiment, an execution body of the method for three-dimensional scene generation may obtain the target text. The target text is usually a descriptive text, and the target text is usually a text determined based on an input operation of a user. As an example, the target text may be a text entered manually by the user, may be a text obtained by converting speech inputted by the user, or may be determined by the user triggering a preset control corresponding to the text. For example, a plurality of preset text controls, such as sunset, sea, and snowflakes, may be presented to the user. If the user triggers the “sunset” control, it is determined that the target text includes sunset.

Then, the execution body may generate the panoramic image described by the target text. Specifically, the target text may be inputted into a pre-trained image generation model, to obtain the panoramic image described by the target text. The image generation model is used to represent a correspondence between a text and a panoramic image described by the text, and may include, but is not limited to: a Generative Adversarial Network (GAN) and a Variational Autoencoder (VAE).

As shown inand,andare schematic diagrams of generating a panoramic image in a method for three-dimensional scene generation according to this embodiment. In, when the user inputs a text “crowded alley, cherry blossom trees, and traditional lanterns” as shown in, a panoramic image as shown inis generated. When the user inputs a text “winding street, antique shops, and old-fashioned lamp posts” as shown in, a panoramic image as shown inis generated.

Step: Obtain multi-view information in a plurality of preset views, and generate a multi-view image in the plurality of views with the panoramic image.

In this embodiment, the execution body may obtain the multi-view information in the plurality of preset views, and generate the multi-view image in the plurality of views with the panoramic image. Since one view corresponds to one camera pose, the plurality of views may correspond to a plurality of camera poses, and the multi-view information may also be understood as camera pose information. Therefore, the multi-view information may include an intrinsic camera parameter and an extrinsic camera parameter.

Herein, the plurality of views may be preset. The multi-view information in the plurality of preset views is obtained, and an image in each view is determined using the panoramic image, to generate an image in each view.

Step: Perform depth estimation on the panoramic image, to determine a sparse point cloud corresponding to the panoramic image.

In this embodiment, the execution body may perform depth estimation on the panoramic image, to determine the sparse point cloud corresponding to the panoramic image.

Specifically, a panoramic depth D (x, y) may be estimated using a panoramic image I (x, y), and projection is performed using an intrinsic camera parameter K and extrinsic camera parameters R and t, to obtain a three-dimensional sparse point cloud.

First, pixel coordinates (x, y) may be converted to the coordinates (X, Y, Z) in a camera coordinate system.

Then, the coordinates (X, Y, Z) in the camera coordinate system may be converted to coordinates (X, Y, Z) in a world coordinate system.

Next, a sparse point cloud may be scaled based on the depth value D (x, y). In this way, a three-dimensional point corresponding to each pixel may be generated with an estimated depth of the panoramic image.

Herein, depth estimation methods such as Zero-Shot Transfer by Combining Relative and Metric Depth (ZoeDepth) or MVSNet (an end-to-end depth estimation framework based on deep learning) are used to perform depth estimation on the panoramic image.

Step: Generate, based on the multi-view image, the multi-view information, and the sparse point cloud, a three-dimensional scene model described by the target text.

In this embodiment, the execution body may generate, based on the multi-view image, the multi-view information, and the sparse point cloud, the three-dimensional scene model described by the target text.

Specifically, the execution body may generate, using three-dimensional reconstruction methods such as Structure From Motion (SFM) reconstruction, Neural Radiance Field (NeRF) reconstruction, and Neural Implicit Surface (NeuS) (a neural surface reconstruction method)/NeuS2 reconstruction, the three-dimensional scene model described by the target text.

In the method provided in the above embodiment of the present disclosure, the target text is obtained, and the panoramic image described by the target text is generated; then, the multi-view information in the plurality of preset views is obtained, and the multi-view image in the plurality of views is generated with the panoramic image; next, depth estimation is performed on the panoramic image to determine the sparse point cloud corresponding to the panoramic image; and finally, the three-dimensional scene model described by the target text is generated based on the multi-view image, the multi-view information, and the sparse point cloud. In this way, a panoramic image described by a text may be generated before a corresponding three-dimensional scene model, ensuring the consistency in the plurality of views and the stability of the three-dimensional scene model.

Reference is made to, which is a schematic diagram of an application scenario of a method for three-dimensional scene generation according to this embodiment. In the application scenario in, the user inputs a text “beach, blue sky, ocean, coconut trees, and sunset” as shown in reference numeral. Then, a panoramic imagedescribed by the text is generated. Next, multi-view informationin a plurality of preset views is obtained, and a multi-view image in the plurality of views, as shown in reference numeral, is generated with the panoramic image. Next, depth estimation is performed on the panoramic image, to determine a sparse point cloud corresponding to the panoramic image, as shown in reference numeral. Finally, a three-dimensional scene model described by the text, as shown in reference numeral, is generated based on the multi-view image, the multi-view information, and the sparse point cloud.

In some optional implementations, the execution body may generate the panoramic image described by the target text in the following manner: generating, using a pre-trained target diffusion model (Stable Diffusion Model), the panoramic image described by the target text, where the target diffusion model is used to represent a correspondence between a text and a panoramic image. A diffusion model may also be referred to as a generative diffusion model. The diffusion model is a type of generative model, which is a type of model that can generate a composite image. The generation of the composite image by the diffusion model starts with random noise and gradually refines through a plurality of steps until an output image appears. In each step, the model may estimate how to change a current input to a denoised version.

The diffusion model outperforms networks such as a GAN and a VAE in generating new images. Specifically, the diffusion model outperforms the networks such as the GAN and the VAE in terms of a memory capacity, a degree of freedom of images, smooth transition between images, a category of the generated image, and the like. The diffusion model is effective and easy to implement, and may generate high-quality images. Therefore, combining the diffusion model with a 3D reconstruction technology can generate a better 3D image or scene required for AR/VR.

In some optional implementations, the target diffusion model is a model obtained by performing a target operation on the original diffusion model. The original diffusion model is usually used to represent a correspondence between a text and a two-dimensional image. The target operation usually includes: freezing a parameter of the original diffusion model, and inserting a learnable module into the original diffusion model, and the learnable module may be configured to convert the two-dimensional image into the panoramic image.

Generally, a neural network has both forward propagation and backward propagation. Freezing a parameter of the neural network means only performing forward propagation on the parameter of the neural network without performing backward propagation, so that the parameter is not optimized. A parameter of the inserted learnable module is optimized, and the parameter of the learnable module is learned, to adjust a generation result of the network, so that the network can complete a specific task. Herein, the learnable module can complete a task of converting the two-dimensional image into the panoramic image.

Herein, the learnable module may obtain one copy of the parameter of the original diffusion model in a controlnet manner, and perform learning on the copy of the parameter, so that the copy of the parameter may complete a specific task.

is a schematic diagram of an embodiment of generating a panoramic image by fine tuning an original diffusion model in a method for three-dimensional scene generation. In, a text description is inputted into an original generative diffusion model, to generate an ordinary 2D image. A panoramic image corresponding to the text description is outputted by freezing a parameter of the original generative diffusion model and inserting a parameter fine-tuning module. A parameter of the parameter fine-tuning module is learnable, and a generation result of the model is adjusted by inserting the parameter fine-tuning module into the original generative diffusion model, so that the model may generate the panoramic image.

The diffusion model is a powerful technology for generating a text sample, but it has a large number of network parameters and requires a lot of training and learning. In order to reduce training time, it is proposed to freeze the parameter in the original generative diffusion model, and insert the learnable module into the model, to adjust the generation result of the model.

In some optional implementations, the learnable module may include a low-rank matrix obtained by decomposing a parameter matrix of the original diffusion model using a low-rank adaptation (LORA) technology.

is a schematic diagram of another embodiment of generating a panoramic image by fine tuning an original diffusion model in a method for three-dimensional scene generation. In, a text xis inputted into a target diffusion model, to obtain a panoramic image h. The target diffusion modelmay being composed of an original diffusion model and a learnable module with learnable parameters.

A mathematical expression of low-rank adaptation may be represented by matrix decomposition. It is assumed that a shape of a parameter matrix W of the original generative diffusion model is d×d, where a dimension of an input feature and a dimension of an output feature are both d. Low-rank adaptation is intended to implement parameter compression and simplification by decomposing the parameter matrix W into a product of two lower-rank matrices. Such decomposition is usually implemented using Singular Value Decomposition (SVD) or other low-rank approximation algorithms. It is assumed that the parameter matrix W is decomposed into a product of two lower-rank matrices A and B, i.e., W=A×B, where A is in a shape of d×r, B is in a shape of r×d, r is a low rank, and r<<d. In this solution, A may be initialized as a standard normal distribution, and B may be initialized as 0.

In this manner, parameter compression and simplification are performed on the generative diffusion model using the low-rank adaptation technology and properties of a low-rank matrix. The parameter matrix of the original model is decomposed into a low-rank approximate expression, so that a storage requirement and computational complexity of the model may be significantly reduced. The parameter of the low-rank matrix is appropriately adjusted and updated, so that the model is effectively tuned.

In some optional implementations, the three-dimensional scene model may include a three-dimensional Gaussian radiance field (3D-Gaussian Splatting). The three-dimensional Gaussian radiance field is an explicit representation method of a 3D scene using a set of differentiable 3D Gaussian functions. Each Gaussian function is defined by a central position, a covariance matrix, a color, and an opacity. Specifically, a position and a covariance matrix of a 3D Gaussian sphere may be initialized first using a position of the sparse point cloud, and a color and an opacity of the 3D Gaussian sphere may be integrated using the multi-view image and the multi-view information. Due to the high rendering quality and high rendering speed of the three-dimensional Gaussian radiance field, in the solution described in this embodiment, a high-quality rendering result can be generated fast, improving the real-time performance of a system and the user experience.

In some optional implementations, after the initialization of the three-dimensional Gaussian radiance field, for each of the plurality of views, the execution body may project the 3D-Gaussian Splatting in to the view, compare a projected image in the view with a multi-view image corresponding to the view, to obtain a loss value, and optimize a parameter of the three-dimensional Gaussian radiance field with the loss value, until the three-dimensional Gaussian radiance field converges. That is, the central position, the covariance matrix, the color, and the opacity are optimized.

The execution body may determine the loss value using the following formula (1):

Patent Metadata

Filing Date

Unknown

Publication Date

September 25, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search