Systems and Methods for Multimodal Pretraining for Three-Dimensional Understanding Models

PublishedSeptember 30, 2025

Assigneenot available in USPTO data we have

InventorsLe Xue Ning Yu Shu Zhang Junnan Li Caiming Xiong+3 more

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of training a neural network based three-dimensional (3D) encoder, the method comprising: generating a first plurality of samples of a training dataset using a first 3D model of a 3D model dataset, wherein the generating the first plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model; generating, using a first language model, a plurality of texts corresponding to the plurality of 2D images respectively, wherein the generating the plurality of texts includes: generating a first number of text descriptions for a first image of the plurality of 2D images; generating a first text based on one or more text descriptions selected from the first number of text descriptions; generating, a point cloud by randomly sampling points in the first 3D model; and generating the first plurality of samples using the plurality of 2D images, the plurality of texts, and the point cloud, wherein a first sample includes the first image, the first text corresponding to the first image, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first plurality of samples.

2. The method of claim 1, wherein the first number of text descriptions are generated automatically without using metadata or a human language annotation associated with the first 3D model.

3. The method of claim 1, wherein the generating the plurality of texts includes: generating a third number of text descriptions using metadata or a human language annotation associated with the first 3D model; and generating a second text based on the first number of text descriptions and the third number of text descriptions; wherein a second sample includes the first image, the second text, and the point cloud.

4. The method of claim 1, wherein viewpoints of the plurality of 2D images of the first 3D model are spaced equally around a center of a 3D object of the first 3D model.

5. The method of claim 1, wherein the first language model includes a first generative model trained via multimodal learning.

6. The method of claim 1, wherein the neural network based 3D encoder is trained using a loss objective, and wherein the loss objective includes a 3D-to-image alignment contrastive loss and a 3D-to-text alignment contrastive loss.

7. The method of claim 1, wherein the training the neural network based 3D encoder using the training dataset including the first plurality of samples includes: generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations.

8. A system for providing a trained neural network based three-dimensional (3D) encoder, the system comprising: a memory that stores a neural network based 3D encoder and a plurality of processor-executable instructions; a communication interface that receives a 3D model dataset including a plurality of 3D models; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating a first plurality of samples of the training dataset using a first 3D model of a 3D model dataset, wherein the generating the first plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model; generate, using a first language model, a plurality of texts corresponding to the plurality of 2D images respectively, wherein the generating the plurality of texts includes: generating a first number of text descriptions for a first image of the plurality of 2D images; generating a first text based on one or more text descriptions selected from the first number of text descriptions; generating, a point cloud by randomly sampling points in the first 3D model; and generating the first plurality of samples using the plurality of 2D images, the plurality of texts, and the point cloud, wherein a first sample includes the first image, the first text corresponding to the first image, and the point cloud; and training the neural network based 3D encoder using the training dataset including the first plurality of samples.

9. The system of claim 8, wherein the first number of text descriptions are generated automatically without using metadata or a human language annotation associated with the first 3D model.

10. The system of claim 9, wherein the generating the plurality of texts includes: generating a third number of text descriptions using metadata or a human language annotation associated with the first 3D model; and generating a second text based on the first number of text descriptions and the third number of text descriptions; wherein a second sample includes the first image, the second text, and the point cloud.

11. The system of claim 8, wherein viewpoints of the plurality of 2D images include: a first plurality of viewpoints spaced equally on a first 360-degree circle around a center of a 3D object of the first 3D model; and a second plurality of viewpoints spaced equally on a second 360-degree circle around the center of the 3D object.

12. The system of claim 8, wherein the first language model includes a first generative model trained via multimodal learning.

13. The system of claim 8, wherein the neural network based 3D encoder is trained using a loss objective, and wherein the loss objective includes a 3D-to-image alignment contrastive loss and a 3D-to-text alignment contrastive loss.

14. The system of claim 8, wherein the training the neural network based 3D encoder using the training dataset including the first plurality of samples includes: generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations.

15. A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a 3D model dataset including a plurality of 3D models; generating a first plurality of samples of the training dataset using a first 3D model of the 3D model dataset, wherein the generating the first plurality of samples includes: generating, using an image generator with multi-view rendering, a plurality of two-dimensional (2D) images having different viewpoints of the first 3D model; generate, using a first language model, a plurality of texts corresponding to the plurality of 2D images respectively, wherein the generating the plurality of texts includes: generating a first number of text descriptions for a first image of the plurality of 2D images; generating a first text based on one or more text descriptions selected from the first number of text descriptions; generating, a point cloud by randomly sampling points in the first 3D model; and generating the first plurality of samples using the plurality of 2D images, the plurality of texts, and the point cloud, wherein a first sample includes the first image, the first text corresponding to the first image, and the point cloud; and training a neural network based 3D encoder using the training dataset including the first plurality of samples.

16. The non-transitory machine-readable medium of claim 15, wherein the first number of text descriptions are generated automatically without using metadata or a human language annotation associated with the first 3D model.

17. The non-transitory machine-readable medium of claim 16, wherein the generating the plurality of texts includes: generating a third number of text descriptions using metadata or a human language annotation associated with the first 3D model; and generating a second text based on the first number of text descriptions and the third number of text descriptions; wherein a second sample includes the first image, the second text, and the point cloud.

18. The non-transitory machine-readable medium of claim 15, wherein viewpoints of the plurality of 2D images include: a first plurality of viewpoints spaced equally on a first 360-degree circle around the center of a center of a 3D object of the first 3D model; and a second plurality of viewpoints spaced equally on a second 360-degree circle around the center of the 3D object.

19. The non-transitory machine-readable medium of claim 15, wherein the first language model includes a first generative model trained via multimodal learning.

20. The non-transitory machine-readable medium of claim 19, wherein the training the neural network based 3D encoder using the training dataset including the first plurality of samples includes: generating image representations using the first image of a first sample of the first plurality of samples; generating text representations using the first text of the first sample; wherein the image representations and the text representations are generated using a pretrained vision and language model; generating 3D representations using the point cloud of the first sample; and updating parameters of the neural network based 3D encoder using a loss objective to align the 3D representations with the image representations and the text representations.

Patent Metadata

Filing Date

Unknown

Publication Date

September 30, 2025

Inventors

Le Xue

Ning Yu

Shu Zhang

Junnan Li

Caiming Xiong

Silvio Savarese

Juan Carlos Niebles Duque

Ran Xu

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search