Patentable/Patents/US-20250378567-A1

US-20250378567-A1

Image Processing Apparatus, Image Processing Method, and Storage Medium

PublishedDecember 11, 2025

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Even in a case where an object has a gloss characteristic on a surface, a three-dimensional field relating to the object is estimated with high accuracy. An image processing apparatus according to the present disclosure obtains a plurality of captured images which is obtained by capturing an object from each of a plurality of image capturing viewpoints and a camera parameter corresponding to image capturing from each of the plurality of image capturing viewpoints, generates a plurality of low gloss images in which a gloss component of the object is reduced based on the plurality of captured images; and performs learning of a learning model indicating a three-dimensional field relating to the object by using the plurality of low gloss images and the plurality of captured images and the camera parameter.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. An image processing apparatus comprising:

. The image processing apparatus according to, wherein the one or more programs further include instructions for:

. The image processing apparatus according to, wherein

. The image processing apparatus according to, wherein the one or more programs further include instructions for:

. An image processing method comprising the steps of:

. A non-transitory computer readable storage medium storing a program for causing a computer to perform a control method of controlling an image processing apparatus, the control method comprising the steps of:

Detailed Description

Complete technical specification and implementation details from the patent document.

The present disclosure relates to a technique of estimating a three-dimensional field based on a plurality of captured images which is obtained by image capturing from various positions and directions.

There is a technique of estimating a three-dimensional field relating to an object by using a plurality of captured images (hereinafter referred to as “multi-viewpoint images”) which is obtained by capturing the object to be captured (hereinafter simply referred to as “object”) from various positions and directions. “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” (hereinafter referred to as “non-patent document 1”) discloses a technique of estimating radiance fields relating to an object, called NeRF (Neural Radiance Fields), as a technique of estimating a three-dimensional field relating to an object by using multi-viewpoint images.

In the NeRF, learning of a neural network to output a color and density (hereinafter simply referred to as “neural network”) is performed with respect to any position and any direction in a space captured from various positions and directions (hereinafter referred to as “image capturing space”). As a result of the learning, a learned neural network indicating the radiance fields relating to the object present in the image capturing space is obtained. By volume rendering using the learned neural network, a virtual viewpoint image corresponding to appearance in a case where the image capturing space is viewed from an arbitrary virtual viewpoint (hereinafter referred to as “virtual viewpoint”) can be generated. Hereinafter, a position at which the image capturing space is captured is described as “image capturing position” and explained.

In the learning of the neural network in the NeRF, first, any position on a ray corresponding to each pixel in each of the plurality of captured images constituting the multi-viewpoint image (hereinafter simply referred to as “ray”) is set as a sampling point. Next, in the neural network in the middle of the learning, a color and the density in the sampling point are estimated and output based on the pixel value of the captured image corresponding to each image capturing position. Subsequently, the colors of a plurality of sampling points on an identical ray are accumulated according to the density, and volume rendering is thereby performed, and the estimation value of a pixel corresponding to the ray in an image is calculated. Then, a difference between the calculated estimation value and the value (pixel value) of the corresponding pixel in the captured image corresponding to the image is calculated as a loss by a loss function. Finally, a network parameter of the neural network is updated by error backpropagation based on the loss calculated by the loss function, and thereby the learning of the neural network is performed. According to the NeRF, the three-dimensional shape, gloss, and transparency of the object can be reproduced by estimating the three-dimensional field relating to the object present in the image capturing space as a radiance field.

In a case where the surface of an object is glossy, because of the reflection of light in the surface of the object, the pixel value of the captured image corresponding to each of a plurality of rays in which the sampling points at the same position are set may be significantly different. In such a case, in the technique described in non-patent document 1, there is a problem that because of the occurrence of an error in learning of transmittance in the sampling points, a three-dimensional field is estimated as if the surface of the object is at a position which is significantly different from an original position. For example, in a case where it is estimated that the surface of the object is on an inner side of the original position of the surface of the object with respect to the object, in the three-dimensional field relating to the object, an indentation arises at a position corresponding to the surface of the object. Conversely, in a case where it is estimated that the surface of the object is on an outer side of the original position of the surface of the object with respect to the object, in the three-dimensional field relating to the object, an artifact such as a cluster called a floater arises near the position corresponding to the surface of the object.

An image processing apparatus obtains a plurality of captured images which is obtained by capturing an object from each of a plurality of image capturing viewpoints, obtains a camera parameter corresponding to image capturing from each of the plurality of image capturing viewpoints, generates plurality of low gloss images in which a gloss component of the object is reduced based on the plurality of captured images and performs learning of a learning model indicating a three-dimensional field relating to the object by using the plurality of low gloss images and the plurality of captured images and the camera parameter.

Further features of the present disclosure will become apparent from the following description of exemplary embodiments with reference to the attached drawings.

Hereinafter, with reference to the attached drawings, the present disclosure is explained in detail in accordance with preferred embodiments. Configurations shown in the following embodiments are merely exemplary and the present disclosure is not limited to the configurations shown schematically. Incidentally, an identical reference numeral is assigned to an identical constituent and an explanation thereof is made. Further, an explanation about each step in the flow charts mentioned below is made by using a reference symbol starting with “S.”

is a diagram illustrating an example of a configuration of an image capturing system in accordance with Embodiment 1. The image capturing system has a plurality of image capturing apparatuses, an image processing apparatus, a user interface (hereinafter “UI”) panel, a storage apparatus, and a display apparatus. Each image capturing apparatusis composed of a digital still camera or a digital video camera or the like and is arranged in a different position. Each image capturing apparatusgenerates a piece of data on a captured image corresponding to each viewpoint (hereinafter referred to as “captured image data”) by capturing an image of an objectpresent in an image capturing spaceaccording to an image capturing condition from a different viewpoint in synchronization with each other.

Incidentally, the image capturing in synchronization with each other does not mean the same time, but means image capturing on performing a synchronization process. In other words, the image capturing in synchronization with each other does not have to be performed at exactly the same time, but a case where the image capturing in synchronization with each other is performed at almost the same time is included. The piece of captured image data which is obtained by the image capturing performed by the image capturing apparatusmay be a piece of still image data or a piece of moving image data, or both the piece of still image data and the piece of moving image data. Hereinafter, an explanation is made on the premise that the wording “image” has both meanings of “still image” and “moving image” unless otherwise specified. The piece of captured image data generated by each image capturing apparatusis transmitted to the image processing apparatus.

The image processing apparatusobtains a piece of data on a plurality of captured images (multi-viewpoint images) transmitted from the plurality of image capturing apparatusesand performs learning of a three-dimensional field relating to a space including the objectpresent in the image capturing spaceby using the obtained multi-viewpoint images. Information or a signal indicating a learned three-dimensional field which is obtained as a result of the learning performed by the image processing apparatusis output to the storage apparatusor the display apparatusor the like. The image processing apparatusmay generate a virtual viewpoint image based on the learned three-dimensional field which is obtained as a result of the learning. In this case, for example, a piece of data or a signal on the virtual viewpoint image generated by the image processing apparatusis output to the storage apparatusor the display apparatusor the like.

Incidentally, in the present embodiment, as illustrated in, an explanation is made on the premise that each of the plurality of image capturing apparatusesis mutually connected to the image processing apparatus, but a method for performing connection between the plurality of image capturing apparatusesand the image processing apparatusis not limited to this. Specifically, for example, the plurality of image capturing apparatusesare cascaded by connecting adjacent image capturing apparatusesto each other, and at least one of the plurality of image capturing apparatusesmay be connected to the image processing apparatus.

Further, in the present embodiment, as illustrated inas an example, an explanation is made based on the premise that the plurality of image capturing apparatusesare arranged in mutually different positions, but the number and the arrangement of the image capturing apparatusesare not limited to this. For example, in a case where the position, shape, and color of the objectpresent in the image capturing spaceand the intensity or tint or the like of ambient light do not temporally change, at least one image capturing apparatuswhose position and orientation may be changed may be arranged. In this case, while the position and orientation of the image capturing apparatusare changed, the image capturing apparatusis made to perform image capturing in each of a plurality of positions different from each other and the image processing apparatusmay obtain a plurality of pieces of captured image data which is obtained by the image capturing as a piece of multi-viewpoint image data.

The UI panelincludes a display device such as a liquid crystal panel and displays a GUI (graphical user interface) to present the image capturing condition in the image capturing apparatusand information such as a process setting of the image processing apparatusto a user on the display device. Further, the UI panelmay include an input device such as a touch panel or a button, and in this case, the UI panelreceives an instruction from a user via the input device in relation to a change or the like in the above image capturing condition or the process setting or the like. Like a mouse or a keyboard or the like, the input device may be provided separately from the UI panel.

The storage apparatusis composed of a hard disk drive or the like, obtains information indicating the three-dimensional field relating to the objectoutput from the image processing apparatus, and stores the obtained information. Further, in a case where the image processing apparatusgenerates the virtual viewpoint image, the storage apparatusmay obtain the piece of data on the virtual viewpoint image output from the image processing apparatusand store the piece of obtained virtual viewpoint image data.

The display apparatusis composed of a liquid crystal display or the like, obtains a signal of a display image including an image indicating the three-dimensional field relating to the objectoutput from the image processing apparatus, and displays the display image corresponding to the signal. Further, in a case where the image processing apparatusgenerates the virtual viewpoint image, the display apparatusmay obtain the signal of the display image including the virtual viewpoint image output from the image processing apparatusand display the display image corresponding to the signal.

The image capturing spaceis a three-dimensional space surrounded by the plurality of image capturing apparatusesinstalled in a studio or the like, and a frame illustrated in a solid line inindicates an outline of the image capturing spacein a floor surface. Hereinafter, as an example, an aspect in which 12 image capturing apparatusesinstalled in the studio are used to capture one or more objectsfrom the surroundings thereof is explained.

is a block diagram illustrating an example of a hardware configuration of the image processing apparatusin accordance with Embodiment 1. The image processing apparatushas, as the hardware configuration, a CPU, a RAM, a ROM, a storage device, a control interface (hereinafter “I/F”), an input I/F, an output I/F, and a main bus.

The CPUis a processor totally controlling each unit of the image processing apparatus. The CPUexecutes an OS (operating system) stored in the ROMor the storage deviceor the like and various kinds of programs by using the RAMas work memory. The CPUcontrols the entire image processing apparatusvia the main busby executing the various kinds of programs. Incidentally, a process of each step illustrated in a flow chart described below is realized as a result of a program code stored in the ROMor the storage deviceor the like being developed into the RAMand the CPUexecuting this. The RAMfunctions as main memory and a work area or the like of the CPU. The ROMstores a group of programs to be executed by the CPU. The storage deviceis composed of a hard disc drive or the like and stores an application program to be executed by the CPUand various kinds of data or the like to be used for a process of the CPU.

The control I/Fis connected to each of the plurality of image capturing apparatusesand is a communication interface to perform control such as the setting of an image capturing condition, a start of image capturing, and a stop of the image capturing on each image capturing apparatus. The input I/Fis a communication interface to perform communication with a serial bus or the like such as an SDI (Serial Digital Interface) or HDMI® (High-Definition Multimedia Interface®). A piece of captured image data output from each image capturing apparatusis obtained via the input I/F. The output I/Fis a communication interface to perform communication with a serial bus or the like such as a USB (Universal Serial Bus) or a DisplayPort®. The piece of data or the signal on the three-dimensional field and the virtual viewpoint image is output to the storage apparatusor the display apparatusvia the output I/F. The main busis a transmission line mutually communicably connecting each hardware configuration which is mentioned above and which the image processing apparatushas.

is a block diagram illustrating an example of a functional configuration of the image processing apparatusin accordance with Embodiment 1. The image processing apparatushas, as the functional configuration, a parameter obtaining unit, an image obtaining unit, a first generation unit, a learning unit, a viewpoint obtaining unit, a second generation unit, and an output unit. Each unit which the image processing apparatushas as the functional configuration is realized as a result of the CPUexecuting the program stored in the ROMor the like by using the RAMas work memory. Incidentally, not all the processes described below have to be always realized as a result of the CPUexecuting the program, and the image processing apparatusmay be configured so that part or all of the processes are executed by one or more process circuits other than the CPU.

The parameter obtaining unitobtains a camera parameter of each image capturing apparatus. The camera parameter of each image capturing apparatusis described on the premise that the camera parameter of each image capturing apparatusis stored in the storage devicein advance. However, the camera parameter of each image capturing apparatusmay be estimated by the image processing apparatusby using the piece of captured image data. In this case, for example, the image processing apparatusestimates the camera parameter of each image capturing apparatusby using an algorithm in which the shape of the objectis estimated while an image capturing position is estimated based on the captured image called COLMAP which is well-known in the technical fields of NeRF or the like. The camera parameter includes an intrinsic parameter, an extrinsic parameter, and a distortion parameter or the like.

The intrinsic parameter is a parameter representing the coordinates of the center of the captured image which is obtained by the image capturing performed by the image capturing apparatus and a focal length of a lens which the image capturing apparatus has. Further, the extrinsic parameter is a parameter representing the position and orientation of the image capturing apparatus, and the distortion parameter is a parameter indicating the distortion of the lens. Incidentally, the plurality of image capturing apparatusesdoes not have to use common camera parameters, and in particular, does not have to use the common intrinsic parameter or the common distortion parameters, and for example, the viewing angles of part of the image capturing apparatusesmay be different from the viewing angles of other image capturing apparatuses. A camera parameter obtained by a parameter obtaining unitis transmitted to the first generation unitand the learning unit.

The image obtaining unitobtains the piece of captured image data which is obtained by the image capturing performed by each image capturing apparatus. A source from which the piece of captured image data is obtained is not limited to the image capturing apparatus, and the image obtaining unitmay obtain the piece of captured image data by reading the piece of captured image data out from the storage deviceor the storage apparatusor the like. Further, the image obtaining unitobtains a piece of data on a background image corresponding to the captured image captured by each image capturing apparatus(hereinafter referred to as “background image data”). The background image is an image captured by using a camera parameter used for capturing of the corresponding captured image and, for example, an image which is obtained by the following image capturing.

For example, the background image is an image which is obtained by in advance capturing an object such as a ground or a structure or the like other than the objectwhich object is expected to be reflected in a case where the objectis captured in a state where the objectdoes not exist. For example, the piece of background image data is stored in the storage deviceor the storage apparatusor the like in advance, and the image obtaining unitobtains the piece of background image data by reading out the piece of background image data from the storage deviceor the like. The piece of captured image data obtained by the image obtaining unitis transmitted to the first generation unitand the learning unit, and the piece of background image data is transmitted to the first generation unit.

The first generation unitgenerates an image in which a gloss component of the objectis reduced (hereinafter referred to as “low gross image”) based on the piece of captured image data and the piece of background image data transmitted from the image obtaining unit. Details on a process of generating the low gloss image using the piece of captured image data and the piece of background image data in the first generation unitare described later in the present embodiment. Further, for example, the image obtaining unitmay generate the low gloss image by using the piece of captured image data and the piece of background image data transmitted from the image obtaining unitand the camera parameter transmitted from the parameter obtaining unit. Details on the process of generating the low gloss image using the piece of captured image data, the piece of background image data, and the camera parameter in the first generation unitare described in Embodiment 2. A piece of data on the low gloss image generated by the first generation unit(hereinafter referred to as “low gloss image data”) is transmitted to the learning unit.

The learning unitestimates the three-dimensional field relating to the objectby performing a learning process in which the piece of captured image data transmitted from the image obtaining unitand the piece of low gloss image data transmitted from the first generation unitare pieces of ground truth data. In the present embodiment, an explanation is made on the premise that the learning unitestimates a radiance field represented by the color and the density of the learning space set in the image capturing spaceas the three-dimensional field relating to the object. The radiance field is represented by using equation (1), for example.

Fis a function which outputs a color (R, G, B) and density σ in the three-dimensional coordinates (x, y, z) in a case where the three-dimensional coordinates (x, y, z) in the set learning space and a direction (θ, Φ) in the learning space are input. In the present embodiment, in relation to a learning model (hereinafter referred to as “three-dimensional field learning model”) in which the Fe is realized by a multilayer perceptron (MLP), an explanation is made on the premise that a radiance field is estimated as a result of the learning unitperforming learning by deep learning. In other words, in the present embodiment, as an example, an aspect in which a learning process in which the piece of captured image data and the piece of low gloss image data are the pieces of ground truth data is performed on the three-dimensional learning model is described. Therefore, in the present embodiment, the result of the estimation of the three dimensional field (radiance field) relating to the objectis obtained as a learned three-dimensional field learning model (hereinafter referred to as “three-dimensional field learned model”). Information on the three-dimensional field learned model which is obtained as the result of the estimation of the three-dimensional field (radiance field) relating to the objectis transmitted to the second generation unitand the output unit.

A viewpoint obtaining unitobtains virtual viewpoint information including at least information indicating the position of the virtual viewpoint and information indicating a viewing direction in the virtual viewpoint (hereinafter referred to as “virtual viewpoint direction”). For example, information or the like indicating the position and the direction of the virtual viewpoint is given as a result of a user providing input by using a GUI (not illustrated) displayed on the UI panel. The virtual viewpoint information obtained by the virtual viewpoint obtaining unitis transmitted to the second generation unit.

The second generation unitgenerates the virtual viewpoint image by using the three-dimensional field learned model transmitted from the learning unitand the virtual viewpoint information transmitted from the viewpoint obtaining unit. Specifically, the second generation unitgenerates the virtual viewpoint image corresponding to the virtual viewpoint information by inputting the virtual viewpoint information to the three-dimensional field learned model and setting a pixel value output from the three-dimensional field learned model as the pixel value of the virtual viewpoint image. A piece of data on the virtual viewpoint image generated by the second generation unitis transmitted to the output unit.

The output unitoutputs the three-dimensional field learned model transmitted from the learning unit. Specifically, for example, the output unitoutputs a piece of data on the three-dimensional field learned model to the storage apparatusand makes the storage apparatusstore the piece of data on the three-dimensional field learned model. The output unitmay display the three-dimensional field as a display image on the display apparatusby converting the three-dimensional field indicated by the three-dimensional field learned model into a display image signal and outputting the display image signal to the display apparatus. Further, the output unitoutputs the virtual viewpoint image transmitted from the second generation unit. For example, the output unitoutputs the piece of data on the virtual viewpoint image to the storage apparatusand makes the storage apparatusstore the piece of data on the virtual viewpoint image. The output unitmay display the virtual viewpoint image on the display apparatusby converting the virtual viewpoint image into the display image signal and outputting the display image signal to the display apparatus.

is a flow chart illustrating an example of a process flow in the image processing apparatusin accordance with Embodiment 1. With reference to, an operation of the image processing apparatusis described. First, in S, the parameter obtaining unitobtains a camera parameter of each image capturing apparatus. Next, in S, the image obtaining unitobtains the piece of captured image data which is obtained by the image capturing performed by each image capturing apparatusand the piece of background image data corresponding to each piece of captured image data. Subsequently, in S, the first generation unitgenerates the low gloss image. Specifically, first, the first generation unitextracts an image area including a representation of the object(hereinafter referred to as “object area”) from each captured image obtained in Sby using the piece of background image data obtained in S. Next, the first generation unitperforms the process of generating the low gloss image based on the extracted object area. A specific process in Sis described below by using.

After S, in S, the learning unitperforms a learning (hereinafter referred to as “first learning”) process in which the low gloss image data generated in Sis a piece of ground truth data on the three-dimensional field learning model. A specific process in Sis described below by using. Next, in S, the learning unitinitializes a parameter relating to a color among parameters in the three-dimensional field learning model which are obtained as a result of the first learning process in S. Subsequently, in S, the learning unitperforms a learning (hereinafter referred to as “second learning”) process in which the piece of captured image data obtained by the image capturing performed by each image capturing apparatusis a piece of ground truth data on the initialized three-dimensional learning model in which the parameter relating to the color is initialized in S. A specific process in Sis described below by using.

After S, in S, the output unitoutputs the three-dimensional field learned model which is obtained as a result of the second learning process in S. Next, in S, the viewpoint obtaining unitobtains the virtual viewpoint information. Subsequently, in S, the second generation unitgenerates the virtual viewpoint image based on the three-dimensional field learned model which is obtained as a result of the second learning process in Sand the virtual viewpoint information obtained in S. Then, in S, the output unitoutputs the virtual viewpoint image generated in S. After S, the image processing apparatusends the processes of the flow chart illustrated in.

The image processing apparatusrepeatedly performs the processes of the flow chart illustrated inevery time the image processing apparatusreceives a new piece of captured image data from each image capturing apparatusor the like. Further, in a case where the piece of captured image data obtained in Sis a piece of moving image data, for example, the image processing apparatusrepeatedly performs the processes of the flow chart illustrated inevery time the image processing apparatusreceives a piece of new frame data constituting the moving image from each image capturing apparatusor the like. In this case, as long as the camera parameter of each image capturing apparatusis not changed, the process in Smay be omitted. Furthermore, similarly, in this case, the obtainment of the piece of background image data in Smay be omitted.

With reference to, a process of generating the low gross image in Sperformed by the first generation unitis described.is a flow chart illustrating an example of a flow of the process of generating the low gloss image in the first generation unitin accordance with Embodiment 1.andare diagrams to describe an example of the process of generating the low gloss image in the first generation unitin accordance with Embodiment 1. Specifically,illustrates an example of a captured imageandillustrates an example of a background imagecorresponding to the captured image. Further,illustrates an example of an object area mapcorresponding to the captured imageandillustrates an example of a low gloss imagecorresponding to the captured image.illustrates an example of the set value of a pixel value in the low gloss imageandillustrates an example of a GUIto set the pixel value in the low gloss image.

The process of the flow chart illustrated inis performed after the process in Sillustrated in. After S, first, in S, the first generation unitselects a piece of data on any captured image (for example, the captured image) from a plurality of pieces of captured image data obtained in S. Next, in S, the first generation unitgenerates the object area mapbased on a difference between the captured imageselected in Sand the background imagecorresponding to the captured imageof a plurality of background images obtained in S. Specifically, the first generation unitfirst extracts an object area in the captured imagebased on the difference. Next, the first generation unitgenerates the object area mapindicating the position of the extracted object areaand the position of a non-object areaindicating an area other than the object area, that is, a background area. For example, the first generation unitdetermines that a pixel in which a difference between the captured imageand the background imageis equal to or greater than a predetermined threshold value is a pixel included in the object area. Further, the first generation unitgenerates the object area mapby determining that a pixel in which the difference is less than the predetermined threshold value is the pixel included in the non-object area.

Incidentally, in the present embodiment, an explanation is made on the premise that the object area in the captured imageis extracted based on the difference between the captured imageand the background image, but a method for extracting the object area in the captured imageis not limited to this. For example, the first generation unitmay extract the object area in the captured imageby using the learned model which is obtained as a result of learning such as machine learning or the like. Specifically, for example, the first generation unitfirst inputs the captured imageto the learned model which outputs information indicating the position of an area of a representation of a predetermined object included in an input image. Next, the first generation unitextracts an area as an object area by obtaining information indicating the position of the area of a representation of the objectincluded in the captured imageoutput from the learned model.

After S, in S, the first generation unitdetermines the pixel value of an areacorresponding to the object areaand the pixel value of an areacorresponding to the non-object areain the object area mapin the low gloss image. For example, the pixel values of the areasandin the low gloss imageare determined based on a value input by the user by using the GUIillustrated inas an example. Incidentally, the GUIis displayed on a display device or the like of the UI panel. Pixel values such as an R (Red) value, a G (Green) value, and a B (Blue) value, and an a value indicating the degree of transparency of a pixel which are input in the GUIare held in the storage devicein, for example, a form as in a tableillustrated inas an example. In other words, the first generation unitrefers to the tableheld in the storage deviceor the like and thereby determines the pixel values of the areasandin the low gloss image. The low gloss imagecorresponding to the captured imageis generated by the process in S.

Incidentally, in the table, a flag indicates in binary whether or not each pixel in the low gloss imageis included in the object areain the object area map. Further, the RGB values express the values of respective color components of R, G, and B in the pixel in the range of 0 to 255. Further, the a value expresses the degree of transparency of the pixel in the range of 0 to 255, and as the a value becomes larger, the pixel becomes more opaque, and as the a value becomes smaller, the pixel becomes more transparent. Because of the process in S, the low gloss imagecorresponding to the captured imageis generated.

After S, in S, the first generation unitjudges in Swhether all the pieces of captured image data obtained in Sare selected in Sor not. In a case where it is judged in Sthat at least part of the pieces of captured image data are not selected, the first generation unitreturns to the process in Sand repeatedly performs the processes from Sto Suntil it is judged in Sthat all the pieces of captured image data are selected. In the repeated processes, in S, the first generation unitselects, for example, any piece of captured image data which has not been selected yet. In a case where it is judged in Sthat all the pieces of captured image data are selected, the first generation unitends the processes in the flow chart illustrated in, namely, the process in S.

With reference to, the first learning process in Sperformed by the learning unitis described.is a flow chart illustrating an example of a flow of the first learning process in which the piece of low gloss image data in the learning unitin accordance with Embodiment 1 is the piece of ground truth data. The processes in the flow chart illustrated inare performed after the process in Sillustrated in. After S, first, in S, the learning unitinitializes the three-dimensional learning model by setting initial values to parameters and hyper-parameters of the three-dimensional field learning model at. Next, in S, the learning unitselects any piece of low gloss image data of a plurality of pieces of low gloss image data generated in S. The piece of low gloss image data selected in Sis used in the process in Smentioned below as a piece of ground truth data in the first learning process.

Next, in S, the learning unitobtains information on a ray corresponding to each pixel of the low gloss image (hereinafter referred to as “ray information”) based on the camera parameter of the image capturing apparatuswhich captures the captured image corresponding to the piece of low gloss image data selected in S. In the present embodiment, the ray information includes information indicating the position of the image capturing apparatus to be a start point of the ray, information indicating the direction of the ray, and information indicating the value (pixel value) of a pixel in the low gloss image corresponding to the ray. The direction of the ray can be represented by using, for example, equation (2) based on the camera parameter and the coordinates of the pixel in the low gloss image.

Here, d represents a direction vector of the ray, (u, v) represents the coordinates of the pixel in the low gloss image, (c, c) represents the coordinates of the center in the low gloss image, and (f, f) represents a focal length of the image capturing apparatus.

After S, in S, the learning unitperforms volume rendering relating to a three-dimensional field learning model in the middle of learning based on the ray information obtained in S. Each pixel value in the virtual viewpoint image corresponding to appearance from the start point of the ray indicated by the ray information is thereby calculated. Specifically, first, the learning unitsets as many rays as the number of rays used for learning preset by the hyper-parameters or the like based on the ray information. Next, the learning unitsets as many sampling points as the number of sampling points preset by the hyper-parameters or the like on each ray which is set. Subsequently, the learning unitperforms volume rendering by accumulating colors estimated in a plurality of sampling points set on each ray according to estimated density. Each pixel value in the virtual viewpoint image in a case where volume rendering is performed based on the ray information is thereby calculated. More specifically, first, the learning unitsets the plurality of sampling points on each ray in the learning space and obtains the estimation values of the colors and the density corresponding to the sampling points based on the position of each sampling point and the direction of the ray. Next, the learning unitcalculates the value (pixel value) of the pixel corresponding to the ray in the virtual viewpoint image by using, for example, equation (3) and equation (4).

Here, C(r) is a pixel value corresponding to a ray r in the virtual viewpoint image, i is an index of a sampling point, oi is density in the sampling point, δis a distance to a next sampling point, and ci is the value of a color in the sampling point. Incidentally, Ti is accumulated transmittance in each sampling point.

is a diagram illustrating an example of a ray r and a sampling pointin accordance with Embodiment 1.illustrates an example of the ray r corresponding to a pixelof the coordinates (u, v) in the low gloss imageselected in S. In, a plurality of black points illustrated in a learning spaceis the sampling pointsset on the ray r.

Patent Metadata

Filing Date

Unknown

Publication Date

December 11, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search