Patentable/Patents/US-20250384669-A1

US-20250384669-A1

Device and Method for Retrieving Multimodal Object Based on Composite Embedding

PublishedDecember 18, 2025

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Provided are a device and method for extracting a multimodal object on the basis of composite embedding. The device extracts training natural language text and training images from a training data storage, generates image composite embeddings including embeddings of the training images and key objects included in the training images, generates natural language composite embeddings on the basis of the training natural language text, and measure multimodal similarities between the image composite embeddings and the natural language composite embeddings.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A device for extracting a multimodal object on the basis of composite embedding, the device comprising:

. The device of, wherein the at least one processor measures errors of the multimodal similarities on the basis of ground truth information extracted from the training data storage.

. The device of, wherein the at least one processor extracts a positive image related to the training natural language text from the training data storage, extracts one or more negative images unrelated to the training natural language text from the training data storage, and configures the training images including the positive image and the negative images.

. The device of, wherein the at least one processor generates image embeddings for the training images using an image embedding model, extracts the key objects from the training images using a key object extraction model, generates key object embeddings which are the embeddings for the key objects using a key object embedding model, and generates the image composite embeddings on the basis of the image embeddings and the key object embeddings.

. The device of, wherein the at least one processor generates natural language embeddings which are the embeddings of the training natural language text using a natural language embedding model, extracts the key words of the training natural language text using a key word extraction model, generates key word embeddings which are the embeddings of the key words using the natural language embedding model, and generates the natural language composite embeddings on the basis of the natural language embeddings and the key word embeddings.

. The device of, wherein the at least one processor updates parameters of the image embedding model and the key object embedding model on the basis of errors of the multimodal similarities.

. The device of, wherein the at least one processor updates parameters of the natural language embedding model on the basis of errors of the multimodal similarities.

. A device for extracting a multimodal object on the basis of composite embedding, the device comprising:

. The device of, wherein the at least one processor generates the image embeddings using an image embedding model, extracts the key objects from all the training images using a key object extraction model, generates the key object embeddings using a key object embedding model, and generates the image composite embeddings on the basis of the image embeddings and the key object embeddings.

. The device of, wherein the image embedding model and the key object embedding model are pretrained models.

. The device of, wherein the at least one processor generates natural language embeddings which are the embeddings of the training natural language text using a natural language embedding model, extracts the key words of the training natural language text using a key word extraction model, generates key word embeddings which are the embeddings of the key words using the natural language embedding model, generates the natural language composite embeddings on the basis of the natural language embeddings and the key word embeddings, and updates parameters of the natural language embedding model on the basis of errors of the multimodal similarities.

. A method of extracting a multimodal object on the basis of composite embedding, the method comprising:

. The method of, further comprising measuring, by the device, errors of the multimodal similarities on the basis of ground truth information extracted from the training data storage.

. The method of, wherein the extracting of the training natural language text and the training images comprises extracting, by the device, a positive image related to the training natural language text from the training data storage, extracting one or more negative images unrelated to the training natural language text from the training data storage, and configuring the training images including the positive image and the negative images.

. The method of, wherein the generating of the image composite embedding comprises generating, by the device, image embeddings for the training images using an image embedding model, extracting the key objects from the training images using a key object extraction model, generating key object embeddings which are the embeddings for the key objects using a key object embedding model, and generating the image composite embeddings on the basis of the image embeddings and the key object embeddings.

. The method of, wherein the generating of the natural language composite embeddings comprises generating, by the device, natural language embeddings which are the embeddings of the training natural language text using a natural language embedding model, extracting the key words of the training natural language text using a key word extraction model, generating key word embeddings which are the embeddings of the key words using the natural language embedding model, and generating the natural language composite embeddings on the basis of the natural language embeddings and the key word embeddings.

. The method of, further comprising updating, by the device, parameters of the image embedding model and the key object embedding model on the basis of errors of the multimodal similarities.

. The method of, further comprising updating, by the device, parameters of the natural language embedding model on the basis of errors of the multimodal similarities.

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims priority to and the benefit of Korean Patent Application No. 10-2024-0077049, filed on Jun. 13, 2024, the disclosure of which is incorporated herein by reference in its entirety.

The present invention relates to a device and method for extracting an image object having the same meaning as a natural language in a multimodal environment on the basis of composite embedding using deep learning.

Multimodal object extraction for recognizing the connection between a natural language and an image and extracting an image object related to the natural language, is being researched on the basis of artificial intelligence (AI). Lately, AI models pretrained using a huge amount of data have been provided to recognize the connection between a natural language and an image.

For multimodal object extraction, a natural language and an image are recognized to measure the similarity between the natural language and the image. To this end, each of the natural language and the image is embedded, and similarity is measured between the natural language embedding and the image embedding. However, image data and natural language data may be simple or complex. For example, in the case of natural language data, there is a simple sentence such as “Get me the milk,” and there is a complex sentence such as “I am proud of my son for waking up early this morning, washing his face and brushing his teeth as usual in spite of his young age, and I would like to give him some fresh milk to go with his apple and egg to help him grow up healthy, and I wonder if you could see if there is any delivered milk from the cow farm this morning and bring it to me so that I can help him grow up healthy and strong.” The accuracy of multimodal object extraction is affected by data used to train an AI model, and it may be difficult for an AI model trained to extract image objects associated with simple natural language text to accurately extract image objects associated with complex natural language text. Also, an AI model trained using images only including simple image objects may not be appropriate for extracting an image object associated with a natural language from an image including various image objects. Therefore, a method and device are necessary to determine the meanings of complex natural language text and a complex image and accurately extract a multimodal object.

The present invention is directed to providing a device and method for extracting a multimodal object on the basis of composite embedding to accurately extract a multimodal object from a complex natural language and complex image data.

Specifically, the present invention is directed to providing a device and method for extracting a multimodal object on the basis of composite embedding, which generate a natural language embedding including an overall sentence embedding and key word embedding information of a natural language, generate an image embedding including overall image embedding and key object embedding information of an image to be extracted, and extract an image that is most closely related to the natural language from a large number of images using the natural language embedding and the image embedding.

Objects of the present invention are not limited to those described above, and other objects which have not been described will be clearly understood by those of ordinary skill in the art from the following description.

According to an aspect of the present invention, there is provided a device for extracting a multimodal object on the basis of composite embedding, the device including a training data extraction module configured to extract training natural language text and training images from a training data storage, an image composite embedding generation module configured to generate image composite embeddings including embeddings of the training images and key objects included in the training images, a natural language composite embedding generation module configured to generate natural language composite embeddings including embeddings of the training natural language text and key words included in the training natural language text, and a multimodal similarity measurement module configured to measure multimodal similarities between the image composite embeddings and the natural language composite embeddings.

The device may further include an error measurement module configured to measure errors of the multimodal similarities on the basis of ground truth information extracted from the training data storage.

The training data extraction module may extract a positive image related to the training natural language text from the training data storage, extract one or more negative images unrelated to the training natural language text from the training data storage, and configure (generate) the training images including the positive image and the negative images.

The image composite embedding generation module may generate image embeddings for the training images using an image embedding model, extract the key objects from the training images using a key object extraction model, generate key object embeddings which are the embeddings for the key objects using a key object embedding model, and generate the image composite embeddings on the basis of the image embeddings and the key object embeddings.

The natural language composite embedding generation module may generate natural language embeddings which are the embeddings of the training natural language text using a natural language embedding model, extract the key words of the training natural language text using a key word extraction model, generate key word embeddings which are the embeddings of the key words using the natural language embedding model, and generate the natural language composite embeddings on the basis of the natural language embeddings and the key word embeddings.

The image composite embedding generation module may update parameters of the image embedding model and the key object embedding model on the basis of errors of the multimodal similarities.

The natural language composite embedding generation module may update parameters of the natural language embedding model on the basis of errors of the multimodal similarities.

According to another aspect of the present invention, there is provided a device for extracting a multimodal object on the basis of composite embedding, the device including an image composite embedding generation module configured to generate, for each of training images stored in a training data storage, image composite embeddings including image embeddings for all the training images and key object embeddings which are embeddings of key objects included in all the training images and store the image composite embeddings in an image composite embedding storage, a training data extraction module configured to extract training natural language text and one or more training images from the training data storage, a natural language composite embedding generation module configured to generate natural language composite embeddings including embeddings of the training natural language text and key words included in the training natural language text, an image composite embedding extraction module configured to extract image composite embeddings matching the one or more training images from the image composite embedding storage, and a multimodal similarity measurement module configured to measure multimodal similarities between the image composite embeddings matching the one or more training images and the natural language composite embeddings.

The image composite embedding generation module may generate the image embeddings using an image embedding model, extract the key objects from all the training images using a key object extraction model, generate the key object embeddings using a key object embedding model, and generate the image composite embeddings on the basis of the image embeddings and the key object embeddings.

The image embedding model and the key object embedding model may be pretrained models.

The natural language composite embedding generation module may generate natural language embeddings which are the embeddings of the training natural language text using a natural language embedding model, extract the key words of the training natural language text using a key word extraction model, generate key word embeddings which are the embeddings of the key words using the natural language embedding model, generate the natural language composite embeddings on the basis of the natural language embeddings and the key word embeddings, and update parameters of the natural language embedding model on the basis of errors of the multimodal similarities.

According to another aspect of the present invention, there is provided a method of extracting a multimodal object on the basis of composite embedding, the method including extracting, by a training data extraction module, training natural language text and training images from a training data storage, generating, by an image composite embedding generation module, image composite embeddings including embeddings of the training images and the key objects included in the training images, generating, by a natural language composite embedding generation module, natural language composite embeddings including embeddings of the training natural language text and the key words included in the training natural language text, and measuring, by a multimodal similarity measurement module, multimodal similarities between the image composite embeddings and the natural language composite embeddings.

The method may further include measuring, by an error measurement module, errors of the multimodal similarities on the basis of ground truth information extracted from the training data storage.

The extracting of the training natural language text and the training images may include extracting, by the training data extraction module, a positive image related to the training natural language text from the training data storage, extracting one or more negative images unrelated to the training natural language text from the training data storage, and configuring (generating) the training images including the positive image and the negative images.

The generating of the image composite embedding may include generating, by the image composite embedding generation module, image embeddings for the training images using an image embedding model, extracting the key objects from the training images using a key object extraction model, generating key object embeddings which are the embeddings for the key objects using a key object embedding model, and generating the image composite embeddings on the basis of the image embeddings and the key object embeddings.

The generating of the natural language composite embeddings may include generating, by the natural language composite embedding generation module, natural language embeddings which are the embeddings of the training natural language text using a natural language embedding model, extracting the key words of the training natural language text using a key word extraction model, generating key word embeddings which are the embeddings of the key words using the natural language embedding model, and generating the natural language composite embeddings on the basis of the natural language embeddings and the key word embeddings.

The method may further include updating, by the image composite embedding generation module, parameters of the image embedding model and the key object embedding model on the basis of errors of the multimodal similarities.

The method may further include updating, by the natural language composite embedding generation module, parameters of the natural language embedding model on the basis of errors of the multimodal similarities.

The present invention relates to a device and method for extracting a multimodal object on the basis of composite embedding to extract an image object having the same meaning as a natural language in a multimodal environment on the basis of composite embedding using deep learning. More specifically, a device and method for extracting a multimodal object on the basis of composite embedding, which generate a natural language embedding including an overall sentence embedding and key word embedding information of a natural language, generate an image embedding including overall image embedding and key object embedding information of an image to be extracted, and extract an image that is most closely related to the natural language from a large number of images using the natural language embedding and the image embedding.

Advantages and features of the present invention and methods of achieving them will become clear with reference to exemplary embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The embodiments are provided only to make the disclosure of the present invention complete and fully convey the scope of the present invention to those skilled in the technical field to which the present invention pertains, and the present invention is only defined by the scope of the claims. Meanwhile, terminology used herein is for describing the embodiments and is not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise. As used herein, “comprise” and/or “comprising” do not preclude the presence or addition of one or more components, steps, operations, and/or elements other than stated components, steps, operations, and/or elements.

Although the terms “first,” “second,” and the like may be used to describe various components, the components are not limited by the terms. These terms are only used to distinguish one component from others. For example, without departing the scope of the present invention, a first component may be named a second component, and similarly, a second component may be named a first component.

When a component is referred to as being “connected” or “coupled” to another component, it should be understood that the two components may be directly coupled or connected to each other, or still another component may be interposed therebetween. On the other hand, when a component is referred to as being “directly connected” or “directly coupled” to another component, it should be understood that there is no intermediate component. Other expressions describing relationships between components, such as “between,” “directly between,” “neighboring,” “directly neighboring,” and the like, should be similarly interpreted.

In describing the present invention, detailed description of related well-known technology that is determined to unnecessarily obscure the subject matter of the present invention will be omitted.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, to facilitate overall understanding, the same reference numeral will be used for the same element throughout the drawings.

The present specification discloses a device and method for extracting a multimodal object on the basis of composite embedding to recognize an image having a meaning related to a natural language from the natural language and image data using a composite embedding-based deep learning model.

Functions of modules included in a device for extracting a multimodal object on the basis of composite embedding will be described below with reference to.

is a block diagram of a device for extracting a multimodal object on the basis of composite embedding according to a first exemplary embodiment of the present invention,is a diagram illustrating an image composite embedding generation process of an image composite embedding generation module according to the first exemplary embodiment of the present invention, andis a diagram illustrating a natural language composite embedding generation process of a natural language composite embedding generation module according to the first exemplary embodiment of the present invention.

Functions of a devicefor extracting a multimodal object on the basis of composite embedding according to the first exemplary embodiment of the present invention will be described below with reference to.

Referring to, the devicefor extracting a multimodal object on the basis of composite embedding according to the exemplary embodiment of the present invention includes a training data storage, a training data extraction module, an image composite embedding generation module, a natural language composite embedding generation module, a multimodal similarity measurement module, and an error measurement module. The devicefor extracting a multimodal object on the basis of composite embedding shown inis in accordance with the exemplary embodiment, and components of the devicefor extracting a multimodal object on the basis of composite embedding according to the present invention are not limited to the embodiment shown inand may be added, changed, or removed as necessary.

The training data storagestores training imagesand training natural language textused for training deep learning models, such as an image embedding model, a key object extraction model, an object embedding model, a natural language embedding model, a key word extraction model, and the like, using supervised learning.

The training data extraction moduleextracts the training natural language textfrom the training data storageand extracts one positive image PIrelated to the training natural language textextracted from the training data storageand k negative images NI, . . . , and NIunrelated to the extracted training natural language text. Here, k is a natural number of 1 or more. The positive image PIand the plurality of negative images NI, . . . , and NIare the training images. For deep learning of the image composite embedding generation moduleand the natural language composite embedding generation module, the training data extraction moduleprovides the training imagesand the training natural language textto the image composite embedding generation moduleand the natural language composite embedding generation module, respectively.

As shown in, the image composite embedding generation modulegenerates k+1 image composite embeddings using the image embedding model, the key object extraction model, and the object embedding modelon the basis of the training imagesreceived from the training data extraction module.

As a deep learning model, the image embedding modelreceives the training imagesand generates image embeddings for each of the training images PI, NI, . . . , and NI. For example, the image embedding modelreceives a positive image (e.g., PI) and generates an image embedding (e.g., IE).

As a deep learning model, the key object extraction modelextracts key objects from the training images. For example, the key object extraction modelreceives a positive image (e.g., PI) and extracts key objects O, . . . , and Ofrom the positive image. Here, m is an integer of 0 or more. For example, after performing segmentation, the key object extraction modelmay apply criteria, such as size (the largest object), location (an object at the center), color (an object with different color from surroundings), and the like, to extract key objects. The key object extraction modelmay be a well-known object detection model. For example, the key object extraction modelmay be a You Only Look Once (YOLO) model or a faster recurrent convolutional neural network (RCNN)-based model.

As a deep learning model, the key object embedding modelreceives key objects and generates embeddings (key object embeddings) for each of the key objects. For example, the object embedding modelmay receive the key objects O, . . . , and Oand generates key object embeddings OE, . . . , and OEfor each of the key objects.

The image composite embedding generation modulegenerates an image composite embedding (e.g., PICE) on the basis of an image embedding (e.g., IE) for a specific training image (e.g., PI) and key object embeddings (e.g., OE, . . . , and OE). For example, the image composite embedding generation modulemay generate image composite embeddings PICE, NICE, . . . , and NICEby concatenating the image embeddings and key object embeddings of the k+1 training images PI, NI, . . . , and NI, respectively.

As shown in, the natural language composite embedding generation modulegenerates a natural language composite embedding TCEusing the natural language embedding modeland the key word extraction modelon the basis of the training natural language text(T) received from the training data extraction module. The training natural language text(T) may include one sentence or a plurality of sentences.

As a deep learning model, the natural language embedding modelreceives the training natural language text(T) and generate a natural language embedding TEwhich is an embedding for the natural language text.

As a deep learning model, the key word extraction modelreceives the training natural language text(T) and extracts n key words W, . . . , and Wfrom the received training natural language text(T). n is an integer of 0 or more. For example, the key word extraction modelmay extract key words from the training natural language texton the basis of information about frequencies, positions, co-occurrence, and the like of words included in the training natural language text.

The natural language composite embedding generation moduleinputs the key words W, . . . , and Wto the natural language embedding modelto generate key word embeddings WE, . . . , and WE.

The natural language composite embedding generation modulegenerates the natural language composite embedding TCEon the basis of the natural language embedding TEand the n key word embeddings WE, . . . , and WE. For example, the natural language composite embedding generation modulemay generate the natural language composite embedding TCEby concatenating the natural language embedding TEand the n key word embeddings WE, . . . , and WE.

Referring back to, functions of the multimodal similarity measurement moduleand the error measurement modulewill be described.

The multimodal similarity measurement modulemeasures similarity between the natural language composite embedding TCEand each of the image composite embeddings PICE, NICE, . . . , and NICE. In other words, the multimodal similarity measurement modulegenerates k+1 multimodal similarities by measuring similarity between one natural language composite embedding and the k+1 image composite embeddings.

Patent Metadata

Filing Date

Unknown

Publication Date

December 18, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search