Patentable/Patents/US-20260100025-A1

US-20260100025-A1

Method and System for Producing a Training Data Set Based Upon Camera Images

PublishedApril 9, 2026

Assigneenot available in USPTO data we have

InventorsPeter Mati Balázs Péter Taranyi

Technical Abstract

i The present invention is concerned with approaches for producing a training data set (TD) based on an input training data set (ITD) comprising multiple input training images (I) obtained from at least one sensor.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

i i i determining a depth map (DM) corresponding to the input training image (I), and determining a three-dimensional point cloud (PC) based on the depth map (DM); v choosing at least one first virtual pose (P) for the sensor which is different to a base pose (P) which corresponds to a pose of the sensor for recording the input training data set (ITD); v v projecting (Proj) the point cloud (PC) to a virtual image (I) corresponding to the first virtual pose (P) v R determining at least one image area (i) of the virtual image (I) with missing or insufficient point cloud (PC) information, and replacing the at least one determined image area (i) by a replacement image area (i) generated by using a generative image generation model (IGM); v obtaining a training data image (I) of the training data set (TD) by using an image rendering model (RM) based on the virtual image (I). . A computer-implemented method, for producing a training data set (TD) based on an input training data set (ITD) comprising multiple input training images (I) obtained from at least one sensor, the method comprising for each input training image (I) the steps of:

claim 1 . The method according to, wherein the depth map (DM) is determined by additionally taking into account an input point cloud dataset corresponding to the input training data set (ITD).

claim 1 v . The method according to, comprising the step of applying an inpainting process to the virtual image (I) and/or to the training data image (I).

claim 1 . The method according to, wherein the depth map (DM) is determined by using a depth model (D), preferably a monodepth model.

claim 1 . The method according to, comprising the step of annotating the training data image (I) by using a data annotation model (DAM).

claim 5 . The method according to, wherein the depth model (D), the image generation model (IGM), the image rendering model (RM) and/or the data annotation model (DAM) comprise at least one neural network.

claim 1 . The method according to, wherein the image generation model (IGM) is a latent variable generative model, preferably a score-based generative model such as a stable diffusion model.

claim 1 . The method according to, wherein the rendering model (RM) is a gaussian splatting model or a neural radiance fields model.

claim 5 . The method according to, wherein the data annotation model (DAM) is a foundation model.

claim 1 v . The method according to, wherein multiple, different virtual poses (P) for the sensor are chosen.

claim 1 . The method according to, further comprising the step of adding the virtual image (Iv) to the three-dimensional point cloud (PC).

claim 1 . The method of, wherein the method is utilized for training of a function implemented by using a machine learning model to implement an ADAS function.

claim 1 . The method ofwherein the training data set is used for training a machine learning model comprising at least one neural network.

claim 1 . A computer program comprising instructions, which, when the program is executed by a computer, cause the computer to carry out the method of.

claim 1 . A computer-readable storage medium comprising instructions executable by at least one processor to perform the method of.

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims the benefit and/or priority to German application 10 2024 209 842.6 filed Oct. 9, 2024, the content of which is incorporated by reference herein in its entirety.

The present invention is concerned with methods, in particular computer-implemented methods, for producing training data sets based on an input training data sets comprising multiple images obtained from a camera. The present invention also comprises computer programs and computer-readable (storage) mediums that store instructions that perform these methods.

The field of machine learning is concerned with algorithms capable of making predictions based on input data and thus refer to data driven models. The underlying algorithms frequently are set up based on training data sets via which the model's parameters are initially fit. This is referred to as the training phase. In case of supervised learning procedures by utilizing optimization methods.

The training data sets consist of a set of input data examples together with targets or labels which refer to what is contained in the input data, e.g., object classes of objects. The input data is processed through the machine learning model, e.g., a neural network, and the results produced are compared with the labels. By means of the comparison and the specific learning algorithm applied, the parameters of the model-in case of a neural network weights of connections between neurons-are adjusted. One of the goals of such training phase is to obtain a trained model that is capable of generalizing to new unknown data with high precision.

In order to set up machine learning models that are capable of performing their tasks with high accuracy, large amounts of different types of training data covering various scenarios are needed. In order to increase training data sets, one option is to apply data augmentation techniques, which refer to the addition of slightly modified copies of training data set examples created using the existing data. The augmentations for instance comprise two dimensional (2D) image transformations, especially geometrical transformations, the addition of noise or color aberrations and many more. The data augmentations are also used to increase the robustness of the respective machine learning model during training. The data augmentation, however, is typically applied during training only and used, e.g., to prevent overfitting, to increase the number of available training data set examples.

In the figures, the same elements are always provided with the same reference symbols.

For different scenarios and/or applications of the same machine learning model, typically completely different training data sets are needed. Especially, if a machine learning model is used for perception tasks, the training data sets need to be adapted to the specific perception situation, for instance to the sensors used to sense the environment, to their location and to various further environmental parameters. However, the collection and processing of image-based training data sets is labor and time intensive and costly.

Thus, it is an object of the present invention to improve the availability of different training data sets for machine learning models.

1 13 14 15 This object is achieved by the method according to claim, its advantageous use according to claim, the computer program according to claimand the computer-readable (storage) medium according to claim.

With regards to the method, the objective technical problem is solved by a method, in particular, a computer-implemented method, for producing a training data set based on an input training data set comprising multiple input training images obtained from a sensor, e.g., a camera or another environmental sensor. The input training data set may as well be collected using multiple sensors of the same type or of different types.

determining a depth map corresponding to the input training image, and determining a three-dimensional point cloud based on the depth map; choosing at least one first virtual pose for the sensor which is different to a base pose which corresponds to a pose of the sensor for recording the input training data set; projecting the point cloud to a virtual image corresponding to the first virtual pose; determining at least one image area of the virtual image with missing or insufficient point cloud information, and replacing the at least one determined image area by a replacement image area generated by using a generative image generation model; obtaining a training data image of the training data set by using an image rendering model based on the virtual image. For each input training image of the input training data set, the method comprises the steps of:

The input training data set comprises multiple images, which may or may not be equipped with corresponding indications or labels. It is also possible that parts of the input training data set comprise labels, and other parts do not comprise labels. The images may be recorded by using one or multiple sensors. In case of an advanced driver assistance system (ADAS) application, the sensor(s) may be mounted on a vehicle.

The training data images produced from the input training data set according to the suggested method form a new training data set that can be used for training of a machine learning model, preferably a machine learning model comprising at least one neural network.

The new training data set obtained by carrying out the suggested method corresponds to a predefined or chosen virtual sensor pose different from a pose of the sensor used to collect the input training data set. In particular, the training data set may differ from the input training data set with respect to a perspective or a height of the sensor. In that case, the present invention allows to generate a training data set with varying perspective and/or height without changing a position of the sensor used to collect the input training data set and without recollecting data by means of the sensor.

Thus, the present invention allows to obtain training data for different scenarios and may be used to adopt machine learning models to the different scenarios without the need of collecting new training data for that purpose. The method may also be used to extend existing training data sets. It is an advantage of the present invention that three dimensional (3D) geometrical transformations may be carried out with respect to the input training data set.

According to one embodiment of the method, the depth map is determined by additionally taking into account an input point cloud dataset corresponding to the input training data set. The input point cloud data set may be recorded together with the input training dataset. For instance, the input point cloud data set may be obtained from at least one radar, ultrasonic or LiDAR sensor.

Preferably, the 3D point cloud is determined by further taking into account intrinsic parameters of the sensor.

It is of advantage, if the suggested method further comprises the step of applying an inpainting process to the virtual image and/or to the training data image. The inpainting process may be based on a denoising diffusion probabilistic model, as, e.g., described in “RePaint: Inpainting using Denoising Diffusion Probabilistic Models” by A. Lugmayr et al., available on arXiv, doi: arXiv:2201.09865.

The depth model in principle may be any depth model available in the state of the art. However, preferably a depth anything model is used, especially the depth anything v2 model, which was suggested by L. Yang et al in “Depth Anything V2”, available on arXiv, doi: arXiv:2406.09414v1.

In a preferred embodiment of the suggested method, the depth model, the image generation model, the image rendering model and/or the data annotation model comprises/comprise at least one neural network.

Preferably, the image generation model is a latent variable generative model, preferably a score-based generative model such as a stable diffusion model. For instance, the image generation model may be based on a diffusion probabilistic model, as suggested by J. Sohl-Dickstein et al in “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” in 2023, available on arXiv, doi:arXiv: 1503.03585, a noise-conditioned score network, as suggested by Y. Song et al in “Generative Modeling by Estimating Gradients of the Data Distribution” in 2019, available on arXiv, doi:arXiv: 1907.05600, or a denoising diffusion probabilistic model, as suggested by J. Ho et al. in “Denoising Diffusion Probabilistic Models” in 2020, also available on arXiv, doi: arXiv:2006.11239.

The rendering module preferably is a gaussian splatting model or a neural radiance fields model. A suitable gaussian splatting model is e.g. suggested by J. Chung et al., in “LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes” in 2023, available on arXiv, doi: arXiv:2311.13384. On the other hand, a neural radiance field model was suggested by B. Mildenhall et al. in “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, available on arXiv, doi: arXiv:2003.08934.

A preferred embodiment of the suggested method comprises the step of annotating the training data image by using a data annotation model.

Preferably, the data annotation model is a foundation model. The data annotation module may be finetuned with respect to a number of specific predefined indications or labels.

It is of advantage, if multiple, different virtual poses for the sensor are chosen.

According to another preferred embodiment, the method according to the present invention further comprises the step of adding the virtual image to the 3D point cloud. It is as well possible to choose multiple virtual poses, at least a first and a second virtual pose and to add all virtual images corresponding to all poses to the 3D point cloud. That way, a full representation of a scene may be obtained.

The method according to the present invention and according to any of the embodiments described preferably is used for training of a function implemented by using a machine learning model, in particular for training of a machine learning model used to implement an ADAS function in a vehicle. ADAS is short for advanced driver assistance system. Such advanced driver assistance systems (ADAS) for vehicles are based on a processing of various data sensed by various ADAS sensors, such as radar, LiDAR and ultrasonic sensors as well as cameras. By means of the ADAS sensors, information relating to an environment of the vehicle can be obtained which in turn is used to realize or implement various ADAS functions. ADAS functions on the one hand may include an assistance for the driver while control of the vehicle remains with the driver. On the other hand, depending on the level of automation, a full autonomously driving vehicle may be realized where ADAS functions are automatically implemented. Known ADAS functions, for instance, are various methods for detecting and/or classifying objects and/or obstacles in the vicinity of the vehicle, methods for lane detection and/or lane departure, methods for rain detection or also various parking assistance functions. Once these functions are performed, the results can be used to control a vehicle either manually or automatically (e.g., using electronic control signals to control vehicle components).

The objective problem underlying the present invention is further solved by a computer program comprising instructions, which, when the program is executed by a computer, cause the computer to carry out the method according to the present invention and according to any of the embodiments described, and by a computer-readable (storage) medium comprising instructions executable by at least one processor to perform the method according to any of the embodiments described or on which the computer program according to the present invention is stored.

Finally, the objective technical problem underlying the present invention is as well solved by a training data set for training a machine learning model, preferably comprising at least one neural network, generated by the method according to the present invention and according to any of the embodiments described.

As mentioned, the invention and its preferred embodiments will be described in more detail based on the subsequent figures.

1 FIG. shows a first preferred embodiment of the method according to the present invention;

2 FIG. illustrates a second preferred embodiment of the method according to the present invention; and

3 3 3 3 FIGS.A,B,C, andD show two examples of input training images and corresponding training data images.

In the figures, the same elements are always provided with the same reference symbols.

Without loss of generalization, the following description relates to an input training data set recorded by a sensor in the form of a vehicle camera. Typically, a camera or multiple cameras are mounted on a vehicle. The collection of training data sets is demanding in view of the time needed to record the data set and costly. Disadvantageously, the recorded training data sets typically can only be used to train machine learning models relating to the specific sensor set up and installation. For instance, a training data set recorded for a normal vehicle may not be suitable for a machine learning model of a truck.

The present invention allows for a transformation of existing input data sets to predefined new virtual camera poses and thus enables to produce new training data sets that fit to different set ups of the camera(s). The input training data set may be transformed to various camera poses and may be used as if training data sets were recorded from the respective camera poses. Accordingly, new training data sets may be generated, and machine learning models, e.g. for implementing ADAS functions, may be (re-)trained using the new training data sets.

1 FIG. i Ina block diagram of a preferred embodiment of the method according to the present invention is shown. An input training data set ITD comprising multiple input trainings images I, recorded by a camera serves as a basis for generating a training data set TD.

i v Optionally, an additional, corresponding input point cloud data IPC set may also be taking into account as indicated by dashed lines. By means of the suggested method each input training image Iis processed so as to derive a virtual image Iof so created training data set I.

A first step of the method comprises determining a depth map DM which corresponds to the input training image li by using a depth model D. The depth model may comprise at least one neural network and may be a state of the art, especially pretrained, depth model, for instance a monodepth model. If available, the depth model may also take into account the input point could data set. This might improve the depth estimation accuracy.

Taking into account the depth map DM, a 3D point cloud PC is generated. For generating the 3D point cloud PC, intrinsic parameters of the camera may as well be taken into account. In addition, at least one first virtual pose Pv for the camera is chosen. The first virtual post Py is different from a base pose P of the camera used to record the input training data set ITD.

v v Subsequently, the 3D point cloud PC is projected Proj to a virtual image Icorresponding to the first virtual pose P. The virtual image Iv will comprise image areas i with missing or insufficient point cloud PC information. Among others, this is due to the fact that certain areas of the vehicle environment may not have been in the field of view of the camera when recording the input training data set ITD.

v R However, in order to produce a full and high-quality virtual image I, those image areas i with missing or insufficient information are determined and provided to an image generation model IGM embodied to determine a replacement image area ibased on the image area i.

Optionally, the image generation model may be equipped with an inpainting function. Alternatively, such inpainting process may also optionally separately be provided.

R R v v Also optionally, at least one prompt may be provided to the image generation model IGM, the prompt defining a condition to determine the replacement image area i. That way, the replacement image area imay be matched to the scene shown in the virtual image I. Prompts may then also be determined in an (semi) automated manner, e.g., by using an image-to-text model which is embodied to provide based on the virtual image I, a description of the scene depicted therein. The prompt may then be derived from the provided description.

v v This virtual image Iis finally provided to a rendering model RM embodied to create a training data image I therefrom. All created training data images I then form part of the created training data set TD. By means of an inpainting process the virtual image Imay be modified by inserting or deleting certain parts with high correlation to the surrounding of that image part. The inpainting process may preferably be applied to the determined image areas i.

v The rendering model RM serves to create a photorealistic training data image I based the virtual image I. It is possible, to further apply a postprocessing step of inpainting to the training data image I if it still contains certain pixels or pixel areas with missing or insufficient information.

v i v Optionally, the created virtual images Imay again be subjected to the depth model D and added to the 3D point cloud PC as indicated by the dashed error. Multiple input training data set images Iand virtual images Imay be concatenated in the 3D point cloud TD in order to obtain a 3D representation of the entire vehicle environment. In addition, individual point clouds may also be combined, e.g. by using an iterative closest point method.

2 FIG. i v i refers to another preferred embodiment, in which at least parts of the input training data set comprise labels L. In this case, the virtual images Ior the training data images I respectively, may be relabeled or reannotated by using a data annotation model DAM. The data annotation model DAM may be a foundation model, e.g. finetuned on the specific labels L, e.g. those of the input training data set ITD.

3 3 3 3 FIGS.A,B,C, andD In, finally two examples of corresponding input training data images and training images are shown which refer to two different scenes and two different transformations of the camera pose P.

3 FIG.A 3 FIG.B 3 FIG.B i In, a first input training image Ii is shown, whereasrefers to a corresponding training data image I. In the image I of, an image area I on the right side which was not depicted in the input training image Iis visible which shows a house. The image area i showing the house was generated by the image generation model IGM.

i 3 FIG.C 3 FIG.D 3 FIG.D The input training image Iofshows a scene of a road ahead of an ego vehicle. In the corresponding training data set image I shown in, result of a virtual rotation and elevation of the camera, shows additional details of houses on the right side of the image, which have not been visible in the original image Ii. Again, this additional image area i is generated by the image generation model IGM. Moreover, for the images Iv and I, also a data annotation model DAM was used resulting in labels L for cares visible in the training data image ofare shown.

Those skilled in the art will recognize that a wide variety of other modifications, alterations, and combinations can also be made with respect to the above described embodiments without departing from the scope of the disclosure, and that such modifications, alterations, and combinations are to be viewed as being within the ambit of the inventive concept.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06V G06V10/774 G06T G06T5/77 G06T15/10 G06V10/82 G06T2207/10028 G06T2207/20081 G06T2207/20084 G06T2207/30252 G06V20/56

Patent Metadata

Filing Date

October 7, 2025

Publication Date

April 9, 2026

Inventors

Peter Mati

Balázs Péter Taranyi

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search