A device according to one embodiment of the present disclosure may comprise a display; and at least one processor configured to: obtain a plurality of two-dimensional (2-Dimensional) images corresponding to a plurality of viewpoints, obtain a 3D image including one or more segmented 3D objects based on the obtained plurality of 2D images, display a UI (User Interface) element indicating that a segmented 3D object is selected on the display in response to a first command to select one of the one or more segmented 3D objects representing a unit object or a detailed object constituting the unit object, and display an edited 3D object on the display in response to a second command to edit the selected 3D object.
Legal claims defining the scope of protection, as filed with the USPTO.
20 -. (canceled)
a display; and at least one processor configured to: obtain a plurality of two-dimensional (2D) images corresponding to a plurality of viewpoints, obtain a three-dimensional (3D) image including one or more segmented 3D objects based on the obtained plurality of 2D images, in response to a first command to select a segmented 3D object of the one or more segmented 3D objects representing a unit object or a detailed object constituting the unit object, control the display to display a user interface (UI) element indicating that the segmented 3D object is selected, and in response to a second command to edit the selected 3D object, control the display to display the edited 3D object. . A device comprising:
claim 21 . The device of, wherein the second command is for editing an attribute of the selected 3D object, and the attribute of the selected 3D object includes at least one of a color, a size or a position of the selected 3D object.
claim 22 obtain a text input as the second command, and edit the attribute of the selected 3D object according to the text input. . The device of, wherein the at least one processor is further configured to:
claim 22 . The device of, wherein the position of the selected 3D object represents a transformation along at least one of an x-axis direction, a y-axis direction, or a z-axis direction.
claim 21 identify, on the 3D image, 3D objects of the one or more segmented 3D objects that are editable. . The device of, wherein the at least one processor is further configured to:
claim 25 identify the 3D objects that are editable using different colors. . The device of, wherein the at least one processor is configured to:
claim 21 control the display to display the selected 3D object at a separate area from an area where the 3D image is displayed, and wherein the selected 3D object displayed at the separate area represents the unit object, or the selected 3D object displayed at the separate area represents the detailed object constituting the unit object. . The device of, wherein the at least one processor is further configured to:
claim 21 select the segmented 3D object representing the unit object in response to the first command, based on a segmentation level being set to a first level that segments the 3D image into unit objects, and select the 3D object representing the detailed object in response to the first command, based on the segmentation level being set to a second level that segments each unit object included in the 3D image into detailed objects. . The device of, wherein the at least one processor is further configured to:
claim 21 select the 3D object representing the unit object based on the 3D object being selected once, and select the 3D object representing the detailed object based on the 3D object being selected more than once consecutively. . The device of, wherein the at least one processor is further configured to:
claim 21 control the display to display tag information including one or more tags identifying 3D objects of the one or more segmented 3D objects that are editable at an area other than an area where the 3D image is displayed. . The device of, wherein the at least one processor is further configured to:
claim 30 . The device of, wherein each of the one or more tags comprises at least one of a color or a name for identifying the corresponding 3D object.
claim 21 . The device of, wherein the UI element comprises an indicator indicating that the segmented 3D object is selected.
claim 21 control the display to display an editing menu for editing the one or more segmented 3D objects, and receive the second command through the editing menu. . The device of, wherein the at least one processor is further configured to:
claim 33 . The device of, wherein the editing menu comprises a granularity level item for setting a granularity level, a size item for setting a size of the selected 3D object, and a transformation item for setting a position of the selected 3D object.
claim 34 reflect the edited 3D object in real time on the 3D image according to the second command for editing a value of one of the granularity level item, the size item or the transformation item for setting the position of the selected 3D object. . The device of, wherein the at least one processor is further configured to:
claim 21 obtain the plurality of 2D images through a camera, and control the display to display a guide indicating that additional shooting is required to generate the 3D image. . The device of, wherein the at least one processor is further configured to:
claim 21 generate a top view image based on the 3D image, based on receiving the second command for editing an attribute of a first 3D object included in the 3D image, change the attribute of the first 3D object according to the received second command while changing an attribute of a second 3D object corresponding to the first 3D object included in the top view image in a same manner. . The device of, wherein the at least one processor is further configured to:
claim 21 obtain the 3D image including the one or more segmented 3D objects based on the plurality of 2D images through a 3D segment model, wherein the 3D segment model is configured to segment an object by extracting global feature candidates representing each segment based on the plurality of 2D images, clustering the extracted global feature candidates, and grouping segments corresponding to a same object. . The device of, wherein the at least one processor is further configured to:
obtaining a plurality of two-dimensional (2D) images corresponding to a plurality of viewpoints; obtaining a three-dimensional (3D) image including one or more segmented 3D objects based on the obtained plurality of 2D images; in response to a first command to select a segmented 3D object of the one or more segmented 3D objects representing a unit object or a detailed object constituting the unit object, displaying a user interface (UI) element indicating that the segmented 3D object is selected; and in response to a second command to edit the selected 3D object, displaying the edited 3D object. . A method of operating a device, the method comprising:
obtaining a plurality of two-dimensional (2D) images corresponding to a plurality of viewpoints; obtaining a three-dimensional (3D) image including one or more segmented 3D objects based on the obtained plurality of 2D images; in response to a first command to select a segmented 3D object of the one or more segmented 3D objects representing a unit object or a detailed object constituting the unit object, displaying a user interface (UI) element indicating that the segmented 3D object is selected; and in response to a second command to edit the selected 3D object, displaying the edited 3D object. . A non-transitory recording medium storing computer-readable instructions that, when executed by a device, cause the device to perform operations comprising:
Complete technical specification and implementation details from the patent document.
Pursuant to 35 U.S.C. § 119, this application claims the benefit of earlier filing date and right of priority to Korean Patent Application Nos. 10-2024-0089114, filed on Jul. 5, 2024, and 10-2025-0088944, filed on Jul. 3, 2025, the contents of which are all hereby incorporated by reference herein in their entirety.
The present invention relates to a device capable of segmenting and recognizing an object in an image.
Conventional techniques for segmenting objects in a three-dimensional (3 Dimension) environment have various limitations.
Although some technologies may segment objects within a 3D scene, they have limitations in precisely segmenting the detailed structure of the object, making it difficult to accurately segment the object with a complex shape.
In particular, in many cases, there is an inconvenience in that the user must manually set the segmentation threshold while directly checking the shape or location of each object.
This method places a high workload on the user and may be difficult to segment accurately unless you are an expert.
In addition, some existing models include complex preprocessing or postprocessing steps for object segmentation or adopt a network structure with a large computational volume, making real-time processing difficult.
This immediately raises the problem that it is difficult to apply to interactive applications or user participation-based applications.
Furthermore, a system that require complex computational processes or manual adjustments have the disadvantage of being unintuitive to general users and having a high barrier to entry.
In addition, it is not very versatile as it requires repeated user feedback for accurate segmentation or a separate pre-training dataset.
Therefore, there is a growing demand for a technology that may efficiently segment objects in a 3D environment in a more precise and automated manner.
The purpose of the present disclosure may be to solve the problems pointed out above.
An object of the present disclosure may be to provide a novel 3D object segmentation technique having both a real-time performance and a user friendliness.
An object of the present disclosure may be to provide a method for easily and quickly performing various segmentations by dividing an object into two stages (an object unit and a smaller unit constituting the object).
An object of the present disclosure may be to provide a method that allows a user to interactively segment a 3D environment in real time with just a click without setting separate boundary values.
A device according to one embodiment of the present disclosure may comprise a display; and at least one processor configured to: obtain a plurality of two-dimensional (2-Dimensional) images corresponding to a plurality of viewpoints, obtain a 3D image including one or more segmented 3D objects based on the obtained plurality of 2D images, display a UI (User Interface) element indicating that a segmented 3D object is selected on the display in response to a first command to select one of the one or more segmented 3D objects representing a unit object or a detailed object constituting the unit object, and display an edited 3D object on the display in response to a second command to edit the selected 3D object.
A method of operating a device according to one embodiment of the present disclosure may obtaining a plurality of two-dimensional (2-Dimensional) images corresponding to a plurality of viewpoints; obtaining a 3D image including one or more segmented 3D objects based on the obtained plurality of 2D images; displaying a UI (User Interface) element indicating that a segmented 3D object is selected in response to a first command to select one of the one or more segmented 3D objects representing a unit object or a detailed object constituting the unit object; and displaying an edited 3D object in response to a second command to edit the selected 3D object.
A non-transitory recording medium having recorded thereon a computer-readable program for performing a method of operating a device according to one embodiment of the present disclosure, the method may comprise obtaining a plurality of two-dimensional (2-Dimensional) images corresponding to a plurality of viewpoints; obtaining a 3D image including one or more segmented 3D objects based on the obtained plurality of 2D images; displaying a UI (User Interface) element indicating that a segmented 3D object is selected in response to a first command to select one of the one or more segmented 3D objects representing a unit object or a detailed object constituting the unit object; and displaying an edited 3D object on the display in response to a second command to edit the selected 3D object.
According to an embodiment of the present disclosure, since editing is possible in a segmented object unit on a 3D image, attributes of each object may be individually adjusted, thereby providing precise customization and an intuitive interface without affecting the entire scene.
According to an embodiment of the present disclosure, by segmenting detailed objects constituting a unit object on a 3D image, attributes of each detailed object may be individually adjusted, and a user may perform precise customization at the detailed element level of the object.
According to an embodiment of the present disclosure, since a segmented 3D object and detailed objects of the 3D object on a 3D image are displayed on separate areas, the cognitive burden of manipulating a specific object in the entire scene may be reduced, and the accuracy of user input may be increased.
According to an embodiment of the present disclosure, by separately displaying coarse and fine level results and enabling selection and attribute editing for each segment, a user may perform intuitive and precise editing work based on a hierarchical structure even within a complex 3D scene.
According to embodiments of the present disclosure, an instability of contrastive learning in which the same object is classified with different segment IDs in different views may be resolved.
Artificial intelligence refers to the field of researching artificial intelligence or methodology to create it, and machine learning refers to the field of defining various problems dealt with in the field of artificial intelligence and researching methodology to solve them.
Machine learning is also defined as an algorithm that improves the performance of a task through consistent experience.
Artificial Neural Network (ANN) is a model used in machine learning, it may refer to an overall model with problem-solving capability that is composed of artificial neurons (nodes) that form a network through the combination of synapses.
Artificial neural network may be defined by connection pattern between neurons in different layers, a learning process that updates model parameter, and an activation function that generates output value.
An artificial neural network may include an input layer, an output layer, and optionally one or more hidden layers. Each layer may include one or more neurons, and the artificial neural network may include synapse connecting neurons. In an artificial neural network, each neuron may output the input signals input through the synapse, weight, and value of activation function for bias.
Model parameter refer to a parameter determined through learning and includes the weight of synapse connection and the bias of neurons. Hyperparameter refer to a parameter that must be set before learning in a machine learning algorithm and includes learning rate, number of repetition, mini-batch size, initialization function, etc.
The purpose of learning an artificial neural network may be seen as determining model parameter that minimize the loss function. The loss function may be used as an indicator to determine optimal model parameter during the learning process of an artificial neural network.
Machine learning may be classified into supervised learning, unsupervised learning, and reinforcement learning depending on the learning method.
Supervised learning refers to a method of training an artificial neural network with a label for the learning data given, a label may mean the correct answer (or result value) that the artificial neural network must infer when learning data is input to the artificial neural network.
Unsupervised learning may refer to a method of training an artificial neural network in a state where no label for training data is given.
Reinforcement learning may refer to a learning method in which an agent defined within an environment learns to select an action or action sequence that maximizes the cumulative reward in each state.
Among artificial neural networks, machine learning implemented with a deep neural network (DNN) that includes multiple hidden layers is also called deep learning, and deep learning is a part of machine learning.
Hereinafter, machine learning is used to include deep learning.
1 FIG. is a block diagram for illustrating elements of an artificial intelligence device according to an embodiment of the present disclosure.
100 The artificial intelligence devicemay be implemented as a fixed or movable device such as a TV, a projector, a mobile phone, a smartphone, a desktop computer, a laptop, a digital broadcasting terminal, a PDA (personal digital assistant), a PMP (portable multimedia player), a navigation, a tablet PC, a wearable device, and a set-top box (STB), a DMB receiver, a radio, a washing machine, a refrigerator, a desktop computer, a digital signage, a robot, a vehicle, etc.
1 FIG. 100 110 120 130 140 150 170 180 Referring to, the artificial intelligence devicemay include a communication interface, an input interface, a learning processor, a sensor, an output interface, a memory, and a processor.
110 200 110 The communication interfacemay transmit and receive data with external device such as other artificial intelligence device or the AI serverusing wired or wireless communication technology. For example, the communication interfacemay transmit and receive sensor information, user input, learning model, and control signal with external device.
110 Communication technologies used by the communication interfaceinclude Global System for Mobile communication (GSM), Code Division Multi Access (CDMA), Long Term Evolution (LTE), 5G, Wireless LAN (WLAN), and Wireless-Fidelity (Wi-Fi), Bluetooth (Bluetooth), RFID (Radio Frequency Identification), Infrared Data Association (IrDA), ZigBee, NFC (Near Field Communication), etc.
120 The input interfacemay acquire various types of data.
120 121 122 123 The input interfacemay include a camerafor capturing image, a microphonefor receiving audio signals, and a user input interfacefor receiving information from a user.
121 122 121 122 The cameraor the microphoneis treated as a sensor, and the signal obtained from the cameraor the microphonemay be called sensing data or sensor information.
120 120 180 130 The input interfacemay obtain training data for model learning and input data to be used when obtaining an output using the learning model. The input interfacemay acquire unprocessed input data, and in this case, the processoror the learning processormay extract input feature by preprocessing the input data.
121 151 170 The cameraprocesses image frame such as still image or moving image obtained by an image sensor in video call mode or photographing mode. Processed image frame may be displayed on displayor stored in memory.
122 100 122 The microphoneprocesses external acoustic signal into electrical voice data. The processed voice data may be utilized in various ways depending on the function (or application being executed) being performed by the artificial intelligence device. Meanwhile, various noise removal algorithms may be applied to the microphoneto remove noise generated in the process of receiving an external acoustic signal.
123 123 180 100 The user input interfaceis for receiving information from the user, when information is input through the user input interface, the processormay control the operation of the artificial intelligence deviceto correspond to the input information.
123 100 The user input interfaceis a mechanical input means (or mechanical key, for example, a button, dome switch, jog wheel, or jog switch located on the front/rear or side of the artificial intelligence device), etc.) and a touch input means.
As an example, the touch input may consist of a virtual key, soft key, or visual key displayed on the touch screen through software processing, or a touch key placed in a part other than the touch screen.
130 The learning processormay train a model composed of an artificial neural network using training data. The learned artificial neural network may be referred to as a learning model. A learning model may be used to infer a result value for new input data other than learning data, and the inferred value may be used as the basis for a decision to perform an operation.
130 240 200 The learning processormay perform AI processing together with the learning processorof the AI server.
130 100 130 170 100 The learning processormay include memory integrated or implemented in artificial intelligence device. The learning processormay be implemented using the memory, an external memory directly coupled to the artificial intelligence device, or a memory maintained in an external device.
140 100 100 The sensormay obtain at least one of internal information of the artificial intelligence device, information on the surrounding environment of the artificial intelligence device, or user information using various sensors.
140 The sensormay include at least one of a proximity sensor, an illumination sensor, an acceleration sensor, a magnetic sensor, a gyro sensor, an inertial sensor, an RGB sensor, an IR sensor, a fingerprint recognition sensor, an ultrasonic sensor, an optical sensor, a microphone, a lidar sensor, or a radar sensor.
150 The output interfacemay generate output related to vision, hearing, or tactile sensation.
150 151 152 153 154 The output interfacemay include a displaythat outputs an image, an audio output interfacethat outputs audio, a haptic devicethat outputs tactile information, and an optical output interfacethat outputs light.
151 100 151 100 The displaydisplays (outputs) information processed by the artificial intelligence device. For example, the displaymay display execution screen information of an application running on the artificial intelligence device, or user interface (UI) and graphic user interface (GUI) information according to the execution screen information.
151 123 100 100 The displaymay be implemented as a touch screen by forming a mutual layer structure or being integrated with the touch sensor. The touch screen functions as a user input interfacethat provides an input interface between the artificial intelligence deviceand the user, and may simultaneously provide an output interface between the artificial intelligence deviceand the user.
152 110 170 The audio output interfacemay output audio data received from the communication interfaceor stored in the memoryin call signal reception, call mode or recording mode, voice recognition mode, broadcast reception mode, etc.
152 The audio output interfacemay include at least one of a receiver, a speaker, or a buzzer.
153 153 The haptic devicegenerates various tactile effects that the user may feel. A representative example of a tactile effect generated by the haptic devicemay be vibration.
154 100 100 The light output interfaceuses light from the light source of the artificial intelligence deviceto output a signal to notify that an event has occurred. Examples of events that occur in the artificial intelligence devicemay include receiving a message, receiving a call signal, a missed call, an alarm, a schedule notification, receiving an email, receiving information through an application, etc.
170 100 170 120 The memorymay store data supporting various functions of the artificial intelligence device. For example, the memorymay store input data obtained from the input interface, learning data, learning model, learning history, etc.
180 100 The processormay determine at least one executable operation of the artificial intelligence devicebased on information determined or generated using a data analysis algorithm or a machine learning algorithm.
180 100 The processormay control the elements of the artificial intelligence deviceto perform the determined operation.
180 130 170 100 To this end, the processormay request, search, receive, or utilize data from the learning processoror the memory, and may control elements of the artificial intelligence deviceto be performed an operation that is predicted or an operation that is determined to be desirable among the at least one executable operation.
180 If linkage with an external device is necessary to perform a determined operation, the processormay generate a control signal to control the external device and transmit the generated control signal to the external device.
180 The processormay obtain intent information for user input and determine the user's request based on the obtained intent information.
180 The processormay obtain intent information corresponding to the user input using at least one of a STT (Speech To Text) engine for converting voice input into a character string or a Natural Language Processing (NLP) engine for acquiring intent information of natural language.
130 240 200 At least one of the STT engine and the NLP engine may be composed of at least a portion of an artificial neural network learned according to a machine learning algorithm. And, at least one of the STT engine or the NLP engine may be learned by the learning processor, learned by the learning processorof the AI server, or learned by distributed processing thereof.
180 100 170 130 200 The processorcollects history information including the user's feedback on the operation of the artificial intelligence deviceand stores it in the memoryor the learning processoror the AI server, etc. May be transmitted to external devices. The collected historical information may be used to update the learning model.
180 100 170 The processormay control at least some of the elements of the artificial intelligence deviceto run an application program stored in the memory.
180 100 The processormay operate two or more of the elements included in the artificial intelligence devicein combination with each other in order to run the application program.
2 FIG. is a diagram for illustrating the configuration of an artificial intelligence server according to an embodiment of the present disclosure.
2 FIG. 200 Referring to, the AI servermay refer to a device that trains an artificial neural network using a machine learning algorithm or uses a learned artificial neural network.
200 200 100 The AI servermay be composed of a plurality of servers to perform distributed processing, and may be defined as a 5G network. The AI servermay be included as a part of the artificial intelligence deviceand may perform at least part of the AI processing.
200 210 230 240 260 The AI servermay include a communication interface, a memory, a learning processor, and a processor.
210 100 The communication interfacemay transmit and receive data with an external device such as the artificial intelligence device.
230 231 231 231 240 a The memorymay include a model memory. The model memorymay store a model (or artificial neural network,) that is being trained or has been learned through the learning processor.
240 231 200 100 a The learning processormay train the artificial neural networkusing training data. The learning model may be used while mounted on the AI serverof the artificial neural network, or may be mounted and used on an external device such as the artificial intelligence device.
230 The learning model may be implemented in hardware, software, or a combination of hardware and software. When part or all of the learning model is implemented as software, one or more instructions constituting the learning model may be stored in the memory.
260 The processormay infer a result value for new input data using a learning model and generate a response or control command based on the inferred result value.
100 100 100 100 1 FIG. 1 FIG. Hereinafter, the artificial intelligence deviceofmay be referred to as a device. The devicemay include all components of the artificial intelligence deviceillustrated in.
180 Hereinafter, one or more processorsmay be provided.
3 4 FIGS.and are drawings illustrating a learning method for generating a 3D image including a 3D object according to one embodiment of the present disclosure.
Hereinafter, a unit object may represent an object corresponding to a relatively large and distinct subject, and a detailed object may represent a partial object constituting the unit object.
A first level may be a coarse level, and a second level may be a fine level.
The coarse level may be a level that identifies (segments) multiple unit objects contained within an image.
The fine level may be a level that identifies (segments) detailed objects from each unit object included in the image.
An object segmentation may be a process of identifying unit objects or detailed objects of each unit object within an image, and distinguishing each identified unit object from other unit objects or distinguishing each identified detailed objects from other detailed objects.
A segment may be a spatially divided region or representation of an object.
A mask may be a binary or multi-class image generated to represent a segment on an image.
A 3D object may represent either a unit object or a detailed object.
301 311 3 FIG. Steps Sto Sofmay be a learning process of a 3D segment model.
The 3D segment model may be a model that generates a 3D image in which one or more objects are segmented from a plurality of 2D images corresponding to a plurality of viewpoints.
3 4 FIGS.and 180 100 301 Referring to, the processorof the devicemay generate a first mask image of a first level and a second mask image of a second level based on a plurality of 2D images (S).
400 1 400 400 1 400 n n Each of the plurality of 2D images-to-may be an image captured from different views (or view points, view angles). The plurality of 2D images-to-may be referred to as training view images.
180 The processormay obtain a first mask image of the first level and a second mask image of the second level from each 2D image using a SAM (Segment Anything model).
170 The SAM may be a model that may automatically segment objects within an image based on simple user input (click, box). The SAM may be stored in memory.
5 5 FIGS.A andB are diagrams illustrating a process of generating a first mask image of a first level and a second mask image of a second level from a 2D image using SAM according to one embodiment of the present disclosure.
500 500 The training imagemay be a 2D image captured from a specific viewing angle. The training imagemay be composed of a plurality of pixels.
180 500 511 512 513 511 512 513 511 512 513 511 512 513 500 511 512 513 500 The processormay independently execute SAM for each training imageto generate a plurality of segments,,. The plurality of segments,,may include a first segment, a second segment, and a third segment. Each of the plurality of segments,,may be a partial region included in the entire region formed by the training image. Each of the plurality of segments,,may correspond to an object included in the training image.
501 500 511 512 513 The pixelincluded in the training imagemay be included in each of the first segment, the second segment, and the third segment.
180 521 501 500 511 511 512 513 521 The processormay generate a first level maskby assigning the pixelincluded in the training imageto the first segmenthaving the largest area among the plurality of segments,,. The first level maskmay have a segment identifier that identifies a unit object.
180 520 500 The processormay generate a first mask imageincluding a plurality of first level masks by assigning each of a plurality of pixels included in the training imageto a segment with the largest area in order of priority.
180 531 501 500 513 511 512 513 531 The processormay generate a second level maskby assigning the pixelincluded in a training imageto the third segmenthaving the smallest area among the plurality of segments,,. The second level maskmay have a segment identifier that identifies a detailed object.
180 530 500 The processormay generate a second mask imageincluding a plurality of second level masks by assigning each of the plurality of pixels included in the training imageto a segment with the smallest area in order of priority.
5 FIG.B 540 550 560 Referring to, the results of applying SAM to an actual 2D imageto obtain a first mask imageat the coarse level and a second mask imageat the fine level are shown.
3 4 FIGS.and Again,are described.
180 303 The processormay generate a first feature field of the first level and a second feature field of the second level based on the plurality of 2D images (S).
180 The processormay generate the first feature field of the first level and the second feature field of the second level based on a plurality of 2D images through a 3D Gaussian Splatting (3DGS) method.
400 1 400 n The 3D Gaussian splatting method may classify a 3D image generated by the plurality of 2D images-to-into a plurality of 3D Gaussians, and project the classified plurality of 3D Gaussians onto a 2D screen to render a feature field.
180 The processormay generate the first feature field from a coarse level feature vector and the second feature field from a fine level feature vector using a Rasterizer rendering method.
6 6 FIGS.A andB are diagrams illustrating a process of generating a first feature field of a first level and a second feature field of a second level based on a plurality of 2D images.
6 FIG.A 600 600 1 600 n Referring to, a 3D view imageis illustrated that is reconstructed based on a plurality of 2D images-to-corresponding to a plurality of view points.
600 The 3D view imagemay be a scene rendering result based on a virtual camera that is configured by integrating information from the plurality of viewpoints.
180 600 The processormay extract a plurality of 3D Gaussians from the 3D view image. A 3D Gaussian may be a unit element that expresses visual information such as a shape, a position, and a color of an object in a 3D space in the form of a probability distribution.
Each 3D Gaussian may be expressed by the following [Mathematical Formula 1].
gi is a i-th 3D Gaussian, pi is a position of the i-th 3D Gaussian, si is a scale of the i-th 3D Gaussian, qi is a quaternion of the i-th 3D Gaussian, oi is a opacity of the i-th 3D Gaussian, and ci may be a color of the i-th 3D Gaussian.
180 The processormay obtain an augmented 3D Gaussian based on the 3D Gaussian and a feature vector.
The augmented 3D Gaussian may be expressed as the following [Mathematical Formula 2].
i ĝis an i-th augmented 3D Gaussian, gi is the i-th 3D Gaussian, and may be an i-th feature vector or an i-th feature vector of the fine level. The feature vector may be referred to as a feature. The augmented 3D Gaussian may be a Gaussian in which the i-th feature vector is combined with the 3D Gaussian.
The i-th feature vector of fine level
may include the i-th feature vector of the coarse level
The i-th feature vector of fine level
may be obtained by the following [Mathematical Formula 3].
may be a feature vector of the i-th coarse level.
may be the i-th transformed feature vector that is the feature vector of the i-th coarse level processed (projected or upsampled) to fit the fine level.
The i-th feature vector of fine level
may be obtained by combining the feature vector of the i-th coarse level
and the i-th transformed feature vector
through a concatenation function.
According to a granularity prior, features that are already distinguished at the coarse level should also be distinguished at the fine level, so the i-th feature vector of the fine level
may configured to include the feature vector of the i-th coarse level
6 FIG.B 180 610 620 Referring to, the processormay generate a first feature fieldfrom a coarse level feature vector and a second feature fieldfrom a fine level feature vector using a Rasterizer rendering method.
The Rasterizer rendering method may be a way to convert feature vectors in the form of 3D modeling into a 2D image composed of pixels.
The Rasterizer rendering method may be a method of generating a feature field through the following [Mathematical Formula 4].
l Frepresents a feature field,
i represents the i-th feature vector, αrepresents a rasterization weight of the i-th 3D Gaussian, Ti may represent an i-th transformation matrix, and l may represent a granularity level.
610 620 Each of the first feature fieldand the second feature fieldmay be a pixel-unit feature distribution map generated by projecting 3D Gaussians onto a 2D screen using the Rasterizer rendering method.
3 FIG. Again,is explained.
180 305 The processormay perform a contrastive learning on the generated first mask image and the generated first feature field, and may perform the contrastive learning on the generated second mask image and the generated second feature field (S).
305 410 4 FIG. Step Smay correspond to the Contrastive Learning processof.
180 The processormay perform the contrastive learning for the first mask image and the first feature field so that pixels having the same segment identifier (segment ID) have similar feature vector, and pixels having different segment identifiers have different feature vector.
180 The processormay perform the contrastive learning for the second mask image and the second feature field so that pixels having the same segment identifier (segment ID) have similar feature vector, and pixels having different segment identifiers have different feature vector.
7 7 FIGS.A andB are diagrams illustrating a process of performing contrastive learning according to an embodiment of the present disclosure.
7 FIG.A 610 550 Referring to, the first feature fieldcorresponding to the coarse level and the first mask imagecorresponding to the coarse level are illustrated.
1 610 The feature vector of the first pixel pincluded in the first feature fieldmay be expressed as
2 and the feature vector of the second pixel pmay be expressed as
180 The processormay calculate a cosine similarity of two feature vectors through the following [Mathematical Formula 5].
l 1 2 1 2 s(p,p) may represent the cosine similarity between feature vectors corresponding to two pixels (p, p).
180 1 2 550 The processordetects when the segment IDs of two pixels (p, p) in the first mask imageof the coarse level are the same
positive pair), may calculate a loss function of the contrastive learning through the following [Mathematical Formula 6].
may be a positive loss function to increase the cosine similarity of feature vectors corresponding to pixels belonging to the same segment.
180 The processormay perform the contrastive learning so that the positive loss function is minimized. Accordingly, feature vectors of pixel pairs belonging to the same segment may be learned to become closer to each other.
7 FIG.B 610 550 Referring to, the first feature fieldcorresponding to the coarse level and the first mask imagecorresponding to the coarse level are illustrated.
1 610 The feature vector of the first pixel pincluded in the first feature fieldmay be expressed as
2 and the feature vector of the second pixel pmay be expressed as
180 The processormay calculate the cosine similarity of two feature vectors through the [Mathematical Formula 5].
180 1 2 550 The processordetects when the segment IDs of two pixels (p, p) in the first mask imageof the coarse level are different
negative pair), the loss function of contrastive learning may be calculated through the following [Mathematical Formula 7].
may be a negative loss function to reduce the cosine similarity of feature vectors corresponding to pixels belonging to different segments.
180 The processormay perform the contrastive learning so that the negative loss function is maximized. The feature vectors of pixel pairs belonging to different segments may be learned to become farther to each other.
7 FIG.B 180 1 2 l Meanwhile, in, the processormay perform the contrastive learning so that the cosine similarity of the feature vectors corresponding to two pixels (p, p) diverse up to a threshold similarity τ. At the coarse level, the threshold similarity may be 0.5, and at the fine level, the threshold similarity may be 0.75, but these are just example values.
7 7 FIGS.A andB Contrastive learning according tomay also be performed for the fine level.
An overall loss function of contrastive learning may be expressed by the following [Mathematical Formula 8].
cont may represent the overall loss function of contrastive learning, and
may be a hyperparameter that controls a balance of
For this type of the contrastive learning, depending on the view of the 2D image, any one pixel contained in the mask image may have the same segment ID or different segment IDs.
8 FIG. is a diagram explaining that the segment ID to which the pixel belongs may vary depending on the view of the 2D image.
8 FIG. 810 830 Referring to, a first imagecaptured from a first view and a second imagecaptured from a different angle from the first view are illustrated.
810 830 811 812 Each of the first imageand the second imagemay include a first unit objectand a second unit object.
820 810 840 830 Additionally, a coarse level mask imageobtained from the first imagethrough the SAM and a coarse level mask imageobtained from the second imagethrough the SAM are illustrated.
820 811 812 821 In the mask imageof the first view, the first unit objectand the second unit objecthave the same segment ID and may therefore be classified as same segment.
840 811 841 812 842 However, in the mask imageof the second view, the first unit objectcorresponds to a first segmentand the second unit objectcorresponds to a second segment, and since they have different segment IDs, they may be classified as different segments.
In this way, mask images generated based on the SAM may suffer from instability in the contrastive learning due to a view mismatch.
Accordingly, this disclosure aims to solve the instability of the contrastive learning through Global Feature-guided Learning (GFL).
3 4 FIGS.and Again,are described.
180 307 The processormay obtain an average feature set including a plurality of average feature vectors corresponding to each of a plurality of views (S).
307 420 4 FIG. Step Smay correspond to Average Pooling processof.
180 309 The processormay cluster the acquired average feature set to obtain a plurality of global clusters (S).
309 430 4 FIG. Step Smay correspond to Clustering processof.
180 311 The processormay perform a contrastive learning between each global feature candidate and feature vector representing each of the extracted plurality of global clusters (S).
311 440 4 FIG. Step Smay correspond to the global feature candidate extraction processand the Global Feature-guided learning (GFL) process of.
307 311 Steps Sto Smay be included in the Global Feature-guided learning (GFL) process.
180 301 311 The processormay learn a 3D segment model through steps Sto S.
9 9 FIGS.A toD are diagrams for explaining Global Feature-guided learning according to an embodiment of the present disclosure.
The Global Feature-guided learning may be a process of segmenting an object by extracting global feature candidates representing each segment based on 2D images corresponding to multiple views, and by clustering the extracted global feature candidates to group segments corresponding to the same object.
9 FIG.A 910 920 910 Referring to, a first mask imageof the coarse level obtained from a 2D image corresponding to a first view through the SAM and a first feature fieldof the coarse level obtained based on 3D Gaussians are illustrated. Each of a plurality of segments constituting the first mask imagemay be assigned a different segment ID. A plurality of pixels constituting each segment may be assigned the same segment ID.
180 921 911 910 The processormay average a plurality of feature vectors corresponding to a plurality of pixels included in a first feature vector setcorresponding to a first segmentincluded in the first mask imagethrough an average pooling method (Average Pooling).
180 901 180 901 900 900 The processormay obtain a first average feature vectorobtained through an averaging process. The processormay place the first average feature vectoron a feature space. The feature spacemay be a high-dimensional feature space.
9 FIG.B 180 922 912 Referring to, the processormay average a plurality of feature vectors corresponding to a plurality of pixels included in a second feature vector setcorresponding to a second segmentthrough the average pooling method.
180 902 180 902 900 The processormay obtain a second average feature vectorobtained through an averaging process. The processormay place the second average feature vectoron the feature space.
9 FIG.C 930 940 930 Referring to, a second mask imageof the coarse level obtained from a 2D image corresponding to a second view through the SAM and a first feature fieldof the coarse level obtained based on 3D Gaussians are illustrated. Each of a plurality of segments constituting the second mask imagemay be assigned a different segment ID. A plurality of pixels constituting each segment may be assigned the same segment ID.
180 941 931 The processormay average a plurality of feature vectors corresponding to a plurality of pixels included in a third feature vector setcorresponding to a third segmentthrough the average pooling method.
180 903 180 903 900 The processormay obtain a third average feature vectorobtained through an averaging process. The processormay place the third average feature vectoron the feature space.
180 901 905 900 9 FIG.D The processormay obtain an average feature set including average feature vectors (to) obtained from all views, and place the obtained average feature set on the feature space, as illustrated in.
180 Meanwhile, the processormay obtain the average feature set through the following [Mathematical Formula 9] by applying the average pooling method.
l,v H×W may be defined as a set of masks for all viewpoints for two levels of masks (e.g., coarse level and fine level). V may be the total number of viewpoints. M∈may be a mask for viewpoint v at level l.is a mean feature set including mean feature vectors, and l={f,c} may represent a granularity level. f may represent the fine level, c may represent the coarse level, and s may identify a segment that identifies an object defined in the mask. s may be a segment ID.
may represent the average feature vector representing the segment s of the viewpoint (v).
may represent a set of pixels belonging to a segment s of a viewpoint v.
may represent the feature vector of pixel p.
D may represent the dimension of the feature vector.
180 900 The processormay perform clustering on the average feature set arranged on a feature spaceusing the HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) method.
The HDBSCAN method is a density-based unsupervised learning method that automatically determines the number of clusters by considering a spatial distribution of average feature vectors and may be a clustering method that excludes noise data.
180 950 960 970 180 951 961 971 The processormay generate a plurality of global clusters,,according to the clustering of the HDBSCAN method. The processormay obtain a plurality of global feature candidates,,representing each global cluster by applying the HDBSCAN method to the average feature vectors. Each global cluster may be a cluster that identifies one unit object. The plurality of global feature candidates may be referred to as.
Each global feature candidate may be the candidate located at the very center of each global cluster.
951 950 961 960 971 970 The first global feature candidatemay represent the first global cluster, the second global feature candidatemay represent the second global cluster, and the third global feature candidatemay represent the third global cluster.
180 After that, the processormay find the global cluster closest to the i-th feature vector
or the i-th 3D Gaussian at level l among plurality of global clusters). The i-th feature vector
may be referred to as a Gaussian feature through the following [Mathematical Formula 10].
is. may represent an identifier of the global cluster c that is most similar to the i-th Gaussian feature
at the granularity level l.
l {tilde over (S)}(i, c) is may represent the cosine similarity between the i-th Gaussian feature
and the global feature candidate
representing the center of the global cluster.
A positive loss function
is used to induce the Gaussian feature to belong to the global cluster with the highest probability may be defined as the following [Mathematical Formula 11].
N may be the number of the Gaussian feature.
may represent a positive loss function for Global Feature-guided learning (GFL).
g τmay be a threshold value to determine whether a particular Gaussian feature belongs to that global cluster or not, and may be set to 0.9.
g If a closeness between the Gaussian feature and the nearest global cluster is greater than τ, the Gaussian feature may be learned to bring the closest global cluster closer together.
Conversely, a negative loss function that encourages the Gaussian feature to move away from other global cluster may be defined as follows: [Mathematical Equation 12].
may represent a negative loss function for Global Feature-guided learning (GFL).
l The gaussian feature may be trained to become distant up to a threshold similarityfrom the global clusters other than the nearest global cluster.
The total loss of the contrastive learning may be defined as follows [Mathematical Formula 13].
In this way, the Global Feature-guided learning method may be a learning method that maintains viewpoint consistency by using global feature candidates obtained from all viewpoints of the scene. In short, the Global Feature-guided learning method may be a learning method that induces the features of each 3D Gaussian to belong to a specific global cluster, while at the same time moving away from other global clusters.
10 FIG. is a flowchart illustrating a method of operating a device according to another embodiment of the present disclosure.
10 FIG. 180 100 1001 Referring to, the processorof the devicemay obtain a plurality of 2D images corresponding to a plurality of viewpoints (S).
In one embodiment, the plurality of 2D images may be images captured from a plurality of camera angles. A viewpoint may be referred to as a camera angle, a view angle, or a view.
180 121 In one embodiment, the processormay acquire the plurality of 2D images captured through the camera.
180 110 In another embodiment, the processormay receive the plurality of 2D images from an external device via the communication interface.
180 151 1003 The processormay obtain a 3D image including one or more segmented 3D objects based on the obtained plurality of 2D images and display the obtained 3D image on the display(S).
180 The processormay obtain a 3D image including one or more 3D objects segmented from the plurality of 2D images through a 3D segment model.
3 FIG. The 3D segment model may be a model learned according to the embodiment of.
11 FIG. is a drawing for explaining a 3D segment model according to one embodiment of the present disclosure.
1100 130 180 100 In one embodiment, the 3D segment modelmay be an artificial neural network-based model learned by the learning processoror processorof the device.
1100 200 180 1100 200 110 1100 170 In another embodiment, the 3D segment modelmay be a model learned by the server, and the processormay receive the 3D segment modelfrom the serverthrough the communication interfaceand store the 3D segment modelin the memory.
1100 1120 1100 1 1100 n The 3D segment modelmay be a model that outputs a 3D imageincluding one or more objects segmented from a plurality of 2D images-to-corresponding to a plurality of viewpoints.
1100 3 FIG. The 3D segment modelmay be a model learned according to the embodiment of.
1100 The 3D segment modelmay be a model that segments an object by extracting global feature candidates representing each segment based on the plurality of 2D images corresponding to the plurality of viewpoints, and grouping segments corresponding to the same object by clustering the extracted global feature candidates.
1120 One or more 3D objects included in the 3D imagemay be selectable (or clickable).
10 FIG. Again, let's explain.
180 Meanwhile, the processormay display the obtained 3D image in a first area and one or more objects segmented within the 3D image in a second area. The first area and the second area may not overlap.
180 1005 The processormay receive a first command to select one of one or more 3D objects included in a 3D image (S).
180 123 In one embodiment, the processormay receive the first command to select a 3D object via a pointer corresponding to a movement of a mouse connected via the user input interface.
180 151 In another embodiment, the processormay receive a touch command for selecting the 3D object as the first command when the displayis implemented as a touch screen.
180 151 1007 The processormay display a UI element corresponding to the selected 3D object on the displayaccording to the received first command (S).
The UI element may be an element that indicates that the 3D object is selected.
In one embodiment, the UI element may be an element for identifying a selected 3D object.
180 For example, the processormay highlight the selected 3D object according to the first command.
180 As another example, the processormay display an indicator indicating that the 3D object has been selected in accordance with the first command at a location adjacent to the selected 3D object. The indicator may be an arrow, an animation, or a pointer, but this is only an example.
In another embodiment, the UI element may include detail objects that configures the selected 3D object. The detail objects may be segmented partial objects at the fine level.
180 1009 The processormay receive a second command for editing the selected 3D object (S).
In one embodiment, the second command may be a command for changing an attribute of the 3D object. For example, the second command may be a command for editing one or more of a name of the 3D object, a color of the 3D object, a size of the 3D object, or a position of the 3D object.
180 151 1011 The processormay display the edited 3D object on the displayaccording to the received second command (S).
10 FIG. Below, an embodiment ofis described in detail.
12 12 FIGS.A toI are drawings for explaining a process of editing a selected 3D object on a 3D image including a plurality of 3D objects according to an embodiment of the present disclosure.
180 1120 1100 1 1100 1100 n 11 FIG. The processormay obtain the 3D imagefrom the plurality of 2D images-to-corresponding to the plurality of viewpoints through the 3D segment modelas illustrated in.
12 12 FIGS.A toI 1100 Referring to, execution screens of a segment application for editing a 3D image obtained through the 3D segment modelare illustrated.
12 FIG.A 180 100 1201 151 Referring to, the processorof the devicemay display a first screenon the displayaccording to the execution of the segment application.
1201 1120 1100 1 1100 1210 n The first screenmay include the 3D imageobtained from the plurality of 2D images-to-corresponding to a plurality of viewpoints and an editing menu.
1210 1211 1212 1213 The edit menumay include a rendering menu, a 3D segmentation menu, and a 3D object editing menu.
1211 The rendering menumay be a menu for settings related to image rendering.
1211 1211 1 1211 2 The rendering menumay include a mode item-for setting the image display mode and a FOV item-for setting the FOV (Field Of View).
1211 1 1211 1 12 FIG.A The image mode item-may be set to either an image mode for displaying a 3D image or a segment mode for displaying a segment image. In, the mode item-may be set to an image mode.
1211 The rendering menumay further include items for setting a background color of the area in which the image is displayed.
1212 1212 1212 1 1212 1 1120 1120 The 3D segmentation menumay be a menu for setting a segmentation level of a mask image. The 3D segmentation menumay include a granularity level item-for setting the segmentation level. The granularity level item-may be set to either the coarse level or the fine level. It is assumed that the 3D imageis an image set to the coarse level. The 3D imagemay be referred to as a coarse level 3D image.
1213 1120 The 3D object editing menumay be a menu for editing the attributes of a 3D object selected in a 3D image.
1213 1213 1 1213 2 1213 3 1213 4 The 3D object editing menumay include a text-based editing item-, a size item-, a transformation item-, and a remove item-.
1213 1 A text-based editing item-may be an item for editing attributes of the 3D object using text input.
1213 2 The size item-may be an item for editing the size of the 3D object.
1213 3 1213 3 The transformation item-may be an item for transforming a direction of the 3D object or moving the 3D object. The transformation item-may be an item for receiving an input for moving the 3D object in the x, y, and z-axis directions.
1213 4 1120 The removal item-may be an item for removing the 3D object within the 3D image.
1120 1201 1100 The 3D imageincluded in the first screenmay include a plurality of 3D objects segmented by the 3D segment model.
1200 1201 180 1200 A pointer (, or cursor) may be displayed on the first screen, and the processormay receive input for manipulating the pointerto select (or click) and edit the 3D object.
180 1202 1220 1211 1 1212 12 FIG.B The processormay display a second screenincluding a first segment imagewhen the image display mode is set to segment mode through the mode item-and the segmentation level is set to the coarse level through the granularity level item, as illustrated in.
1220 1220 1220 550 5 FIG.B The first segment imagemay be an image that identifies a selectable 3D object at the coarse level. The coarse level may be a level that segments an entire area into unit objects. The first segment imagemay be a mask image. For example, the first segment imagemay correspond to the first mask imageof.
180 1203 1230 1211 1 1212 12 FIG.C The processormay display a third screenincluding a second segment imagewhen the image display mode is set to segment mode through the mode item-and the segmentation level is set to the fine level through the granularity level item, as shown in.
1230 1230 1230 560 5 FIG.B The second segment imagemay be an image that identifies a selectable 3D object at a fine level. The fine level may be a level that segments each of a plurality of unit objects into detailed objects. The second segment imagemay be a mask image. For example, the second segment imagemay correspond to the second mask imageof.
12 FIG.D 180 1121 1120 180 1204 1121 151 Referring to, when the processorreceives a command to select a first 3D objectwithin the 3D image, the processormay display a fourth screenidentifying the first 3D objectselected according to the received command on the display.
180 1121 1121 180 1121 The processormay highlight the selected first 3D objector identify the first 3D object. The processormay further display an indicator (not shown) indicating that the first 3D objecthas been selected.
180 1121 1213 2 180 1205 1121 151 1121 12 FIG.E When the processorreceives a command to adjust the size of the first 3D objectselected through the size item-, as illustrated in, the processormay display a second screenincluding the first 3D objectwhose size has been adjusted, on the display. The command to adjust the size of the first 3D objectmay be a command to input a scale value.
1122 1121 Meanwhile, a second 3D objectmay be placed adjacent to the first 3D object.
180 1122 1213 3 1122 180 1206 1122 151 The processormay move the second 3D objectfrom a first point to a second point according to a transformation command that inputs an x-axis value, a y-axis value, and a z-axis value through the transformation item-after the second 3D objectis selected. The processormay display a sixth screenshowing a state in which the second 3D object is moved according to the transformation command of the second 3D object, on the display.
180 1120 1212 1 1213 2 1213 3 The processormay reflect the edited 3D object in real time on the 3D imageaccording to a command to edit the value of any one of the granularity level item-, the size item-, or the transformation item-for setting the position of the selected 3D object.
In this way, according to the embodiment of the present disclosure, since editing is possible in segmented object unit on a 3D image, attributes of each object may be individually adjusted, thereby providing precise customization and an intuitive interface without affecting the entire scene.
1212 1 1206 180 1207 1120 1 151 12 FIG.G When the granularity level item-is set to the fine level while the sixth screenis displayed, the processormay display the seventh screenincluding a fine level 3D image-on the display, as in.
1120 1 The fine level 3D image-may be an image in which each of a plurality of 3D objects is subdivided into detailed objects.
1120 1 1210 The detailed objects that make up a 3D object in the fine level 3D image-may be edited through the edit menu.
1121 1121 1121 1121 1121 a b a b A first 3D objectmay be segmented into a first detailed objectand a second detailed object. Each of the first detailed objectand the second detailed objectmay be selectable or clickable.
1121 1121 1121 a b If the first 3D objectis an animal figure, the first detailed objectmay represent a body of the animal figure, and the second detailed objectmay represent the a leg of the animal figure.
180 1121 1121 a b. The processormay edit the size and position of a selected detailed object according to an editing command of the first detailed objector the second detailed object
180 1208 1213 1 1123 1123 12 FIG.H a a. The processormay display an eighth screenin which a source text of apple indicating a source and a target text of rainbow color apple indicating a target are entered on a text-based editing item-, as illustrated in. The source text may be a text for selecting a first detailed object, and the target text may be a text for editing an attribute of the selected first detailed object
1123 1123 1123 a b A third 3D objectrepresenting a fruit may be segmented into the first detailed objectrepresenting an apple and the second detailed objectrepresenting a leaf.
180 1123 1123 180 1123 1123 a a The processormay select the first detailed objectof the third 3D objectcorresponding to apple according to the input of the source text. The processormay edit the first detailed objectof the third 3D objectselected according to the input of the target text to have a rainbow color.
180 1208 1123 1209 1123 a a 12 FIG.I The processormay switch the eighth screenincluding the first detailed objecthaving a green color to a ninth screenincluding the first detailed objecthaving a rainbow color according to the input of the source text and the target text (see).
In this way, according to an embodiment of the present disclosure, by segmenting detailed objects constituting a unit object on a 3D image, the attributes of each detailed object may be individually adjusted, and the user may perform precise customization at the detailed element level of the object.
13 13 FIGS.A toD are drawings for explaining a method for providing information on segmented 3D objects in a 3D image according to an embodiment of the present disclosure.
13 FIG.A 180 1300 1301 1302 1303 1304 151 180 1210 1300 1210 Referring to, the processormay display a 3D imageincluding a plurality of 3D objects,,,on the display. The processormay further display the edit menuon one side of the 3D image. The display of the edit menuis omitted.
180 1301 1302 1303 1304 1300 The processormay display a selected 3D object among a plurality of 3D objects,,,on an area other than an area where a 3D imageis displayed.
1301 180 1301 1311 1300 180 1301 1311 1301 1301 1311 a a For example, when a first 3D objectis selected, the processormay display the selected first 3D objecton a first regionthat is a separate region from the region where the 3D imageis displayed. The processormay display the detailed objecton the first regionaccording to a command for selecting the detailed objectof the first 3D objecton the first region.
1302 180 1302 1312 1300 180 1302 1312 1302 1302 1312 a a When a second 3D objectis selected, the processormay display the selected second 3D objecton a second areathat is a separate area from the area where the 3D imageis displayed. The processormay display the detailed objecton the second areaaccording to a command for selecting a detailed objectof the second 3D objecton the second area.
180 1301 1301 1301 1301 a a a In another embodiment, the processormay determine that the first 3D objectis selected when the detailed objectis selected once (or clicked), and may determine that the detailed objectis selected when the detailed objectis selected twice in a row (or double-clicked).
In this way, according to an embodiment of the present disclosure, since segmented 3D objects and detailed objects of 3D objects on a 3D image are displayed on separate areas, the cognitive burden of manipulating a specific object in the entire scene may be reduced, and the accuracy of user input may be increased.
13 FIG.B 180 1300 1330 1340 151 Referring to, the processormay display the 3D image, a coarse level result window, and a fine level result windowon the display.
1300 180 1300 1330 1340 151 When the 3D imageis acquired from the plurality of 2D images corresponding to the plurality of viewpoints, the processormay display the 3D image, the coarse level result window, and the fine level result windowon the display.
1330 1300 The coarse level result windowmay include a plurality of unit objects segmented from the 3D image. Each of the plurality of segmented unit objects may be selectable and editable.
1340 1330 The fine level result windowmay include detailed objects that subdivide each of the multiple unit objects included in the coarse level result window. Each of the subdivided detailed objects may be selectable and editable.
In this way, according to the embodiment of the present disclosure, by separately displaying coarse and fine level results and enabling selection and attribute editing for each segment, a user may perform intuitive and precise editing work based on a hierarchical structure even within a complex 3D scene.
13 FIG.C 180 1300 1350 151 1350 1350 Referring to, the processormay display a 3D imageand tag informationon the display. The tag informationmay be information for identifying 3D objects that may be selected and edited. The tag informationmay include one or more of a color of the 3D object and a name of the 3D object.
180 1300 1 1300 2 1300 3 1300 1 1300 2 1300 3 The processormay further display a plurality of 3D view images-,-,-. Each of the plurality of 3D view images-,-,-may be an image corresponding to a different view point.
In this way, according to an embodiment of the present disclosure, tag information for an editable 3D object is provided, so that a user may easily and clearly recognize the editable object.
13 FIG.D 180 1300 180 1300 1301 1302 Referring to, the processormay identify a plurality of editable 3D objects within the 3D image. For example, the processormay identify a plurality of editable 3D objects within the 3D imagewith different colors. Specifically, the color of the first 3D objectmay be expressed as a first color, and the color of the second 3D objectmay be expressed as a second color.
14 FIG. is a drawing explaining a method for guiding shooting when shooting a 2D image to generate a 3D image according to one embodiment of the present disclosure.
14 FIG. 180 1410 121 151 Referring to, the processormay display a 2D imagecorresponding to a specific view point captured by the cameraon the display.
The user may capture a scene multiple times with different viewpoints.
180 151 The processormay display a guide on the displayindicating that additional shooting is required when additional 2D images are required to generate a 3D image. The guide may be expressed in the form of a text or a progress bar.
180 1420 1420 1100 The processormay further display a first progress barindicating the progress of the generation of the 3D image. The first progress barmay be a UI indicating that additional shooting for another viewpoint is required for outputting the 3D image through the 3D segment model.
180 1421 1411 1411 1421 1411 The processormay display a second progress baron one side of the first objectindicating that additional shooting for a different viewpoint is required for a segment of the first object. The second progress barmay indicate a degree of need for additional shooting for a scene including the first object.
180 1422 1412 1412 1422 1412 The processormay display a third progress baron one side of the second objectindicating that additional shooting for the different viewpoint is needed for a segment of the second object. The third progress barmay indicate the degree to which additional shooting is needed for a scene including the second object.
180 1410 Additionally, the processormay distinguish and display objects that may be reconfigured into 3D objects and objects that cannot be reconfigured into the 3D objects among a plurality of objects included in the 2D image.
1410 According to an embodiment of the present disclosure, by visually distinguishing and displaying the progress of segment learning for each object on the 2D image, the user may intuitively recognize which object requires additional shooting, and effectively respond to securing an optimal viewpoint for generating the 3D image.
In addition, by distinguishing between objects that may be reconstructed as 3D objects and those that cannot, the user may selectively collect data that is valid for 3D modeling, thereby improving the efficiency and accuracy of the entire shooting and learning process.
15 FIG. is a drawing illustrating an example of performing furniture arrangement in a home based on a 3D image obtained through a 3D segment model according to one embodiment of the present disclosure.
15 FIG. 180 1500 1100 1500 1510 Referring to, the processormay obtain a 3D imagefrom a plurality of 2D images corresponding to a plurality of viewpoints through the 3D segment model. The 3D imagemay include a segmented TV objectcorresponding to a TV.
180 1210 1500 180 1510 1510 1510 The processormay further display the editing menuon one side of the 3D image. The processormay change at least one of the size or position of the TV objectaccording to a first command for selecting the TV objectand a second command for editing the selected TV object.
1510 1510 By editing the size and placement of the TV object, the user may adjust a visual configuration or a placement environment occupied by the TV objectwithin the 3D space according to the user's intention. In addition, a realism and an usability may be increased in virtual interior design and space simulation.
180 1510 1510 180 Meanwhile, the processormay provide a link for purchasing a similar product when editing the size, color, or texture of the TV object. For example, when enlarging the size of the TV object, the processormay display an address of a website that sells a TV corresponding to the enlarged size.
16 FIG. may be a drawing illustrating a scenario for generating a top view image based on a 3D image and linking the 3D image and the top view image according to one embodiment of the present disclosure.
16 FIG. 180 1610 1610 1611 Referring to, the processormay generate a 3D imagerepresenting the interior of a room from a plurality of 2D images corresponding to a plurality of viewpoints. The 3D imagemay include a segmented first 3D objectrepresenting an air purifier.
180 1620 1610 1620 1621 1611 1611 1621 The processormay generate a top view imagecorresponding to a top view point based on the 3D image. The top view imagemay include a second 3D objectin a top view form corresponding to the first 3D object. The first 3D objectand the second 3D objectmay represent the same object, that is, an air purifier.
180 1611 1610 1621 1620 The processormay equally reflect the edited result according to the editing command for the first 3D objectof the 3D imageto the second 3D objectof the top view image.
180 1611 1610 180 1611 180 1621 1620 For example, when the processorreceives an editing command to move the first 3D objectof the 3D imagefrom a first location to a second location, the processormay move the first 3D objectfrom the first location to the second location. At the same time, the processormay move the second 3D objectfrom the first location to the second location on the top view image.
180 1621 1620 1611 1610 The processormay equally reflect an edited result according to the editing command for the second 3D objectof the top view imageto the first 3D objectof the 3D image.
In this way, according to the embodiment of the present disclosure, the editing result on the 3D image is automatically reflected on the top view image, and vice versa, so that the user may perform intuitive and consistent editing work at various viewpoints, and accordingly, the precision of object arrangement in 3D space and editing efficiency may be greatly improved.
100 151 180 A deviceaccording to one embodiment of the present disclosure may comprise a display; and at least one processorconfigured to: obtain a plurality of two-dimensional (2-Dimensional) images corresponding to a plurality of viewpoints, obtain a 3D image including one or more segmented 3D objects based on the obtained plurality of 2D images, display a UI (User Interface) element indicating that a segmented 3D object is selected on the display in response to a first command to select one of the one or more segmented 3D objects representing a unit object or a detailed object constituting the unit object, and display an edited 3D object on the display in response to a second command to edit the selected 3D object.
The second command may be a command for editing an attribute of the selected 3D object, and the attribute of the 3D object may include at least one of a color, a size or a position of the 3D object.
180 The at least one processormay further obtain a text input as the second command, edit the attribute of the 3D object according to the text input, and display the edited 3D object on the display.
The position of the 3D object may represent a transformation along at least one of the x-axis, y-axis, or z-axis directions.
180 The at least one processormay further identify the one or more segmented 3D objects that are editable on the 3D image.
180 The at least one processormay identify the one or more segmented 3D objects with different colors.
180 The at least one processormay further display the selected 3D object in a separate area from an area where the 3D image is displayed, and the 3D object displayed on the separate area can be selectable as a single unit object, or any one of the detailed objects constituting the selected 3D object can be selectable.
180 The at least one processormay further select the unit object in response to the first command if a segmentation level is set to a first level that segments the unit object included in the 3D image, and select the detailed object in response to the first command if the segmentation level is set to a second level that segments detailed objects constituting each unit object included in the 3D image.
180 The at least one processormay further select the unit object if the 3D object is selected once, and select the detailed object if the 3D object is selected in a row.
180 The at least one processormay further display tag information including one or more tags identifying the one or more segmented 3D objects that can be editable on the 3D image on an area other than an area where the 3D image is displayed.
The tag may comprise at least one of a color or a name for identifying the 3D object.
The UI element may comprise an indicator indicating that the 3D object is selected.
180 The at least one processormay further display an editing menu on the display for editing the one or more of the segmented 3D objects, and receive the second command through the edit menu.
The edit menu may comprise a granularity level item for setting a granularity level, a size item for setting a size of the selected 3D object, and a transformation item for setting a position of the selected 3D object.
180 The at least one processormay further reflect the edited 3D object in real time on the 3D image according to the second command for editing the value of any one of the items of the granularity level item, the size item or the transformation item for setting the position of the selected 3D object.
180 121 The at least one processormay further obtain the plurality of 2D images through a camera, and display a guide indicating that additional shooting is required to generate the 3D image on the display.
180 The at least one processormay further generate a top view image based on the 3D image, when the second command for editing the attribute of a first 3D object included in the 3D image is received, change the attribute of a second 3D object corresponding to the first 3D object included in the top view image in the same manner while changing the attribute of the first 3D object according to the received second command.
180 1100 The at least one processormay further obtain the 3D image including the one or more segmented 3D objects based on the plurality of 2D images through a 3D segment model, wherein the 3D segment model may a model that segments an object by extracting global feature candidates representing each segment based on the plurality of 2D images, clustering the extracted global feature candidates, and grouping segments corresponding to the same object.
The functions of the elements disclosed in the present invention may be implemented using circuits or processing circuits including general purpose processors, special purpose processors, integrated circuits, ASICs (application specific integrated circuits), conventional circuits and/or combinations thereof. A processor may be defined as a processing circuit or circuits including transistors and other circuits.
In the present invention, a circuit, a unit or a mean may be hardware designed or programmed to perform the specified function. The hardware may be the hardware disclosed in the present invention or other known hardware programmed or configured to perform the specified function. If the hardware is a processor which may be considered as a type of circuit, the circuits, means or units may be a combination of hardware and software, and the software may constitute hardware and/or a processor.
180 The above-described present disclosure may be implemented as a computer-readable code on a medium in which a program is recorded. The computer-readable medium includes all kinds of recording devices in which data that may be read by a computer system is stored. Examples of the computer-readable medium include a hard disk drive (HDD), a solid state disk (SSD), a silicon disk drive (SDD), a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like. In addition, the computer may include the processorof an artificial intelligence device.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
July 7, 2025
January 8, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.