Patentable/Patents/US-10949718
US-10949718

Multi-modal visual question answering system

PublishedMarch 16, 2021
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

The systems and methods described herein may generate multi-modal embeddings with sub-symbolic features and symbolic features. The sub-symbolic embeddings may be generated with computer vision processing. The symbolic features may include mathematical representations of image content, which are enriched with information from background knowledge sources. The system may aggregate the sub-symbolic and symbolic features using aggregation techniques such as concatenation, averaging, summing, and/or maxing. The multi-modal embeddings may be included in a multi-modal embedding model and trained via supervised learning. Once the multi-modal embeddings are trained, the system may generate inferences based on linear algebra operations involving the multi-modal embeddings that are relevant to an inference response to the natural language question and input image.

Patent Claims
20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

1. A method for visual question inference, the method comprising: receiving an input image and a natural language query; determining content classifications for portions of the input image; generating a scene graph for the input image, the scene graph including the content classifications arranged in a graph data structure comprising nodes and edges, the nodes respectively representative of the content classifications for the input image and the edges representative of relationships between the content classifications; generating multi-modal embeddings based on the input image and the scene graph, the multi-modal embeddings being respectively associated with the nodes, the edges, or any combination thereof, wherein at least a portion of the multi-modal embeddings are generated by: determining symbolic embeddings for the content classifications of the input image, the symbolic embeddings representative of nodes of the scene graph, edges of the scene graph, or any combination thereof; determining a sub-symbolic embedding for the input image, the sub-symbolic embedding comprising an image feature vector for the input image; identifying separate portions of the image feature vector that are representative of the portions of the input image; generating weighted sub-symbolic embeddings for each of the content classifications by applying weight values to the separate portions of the image feature vector; aggregating the symbolic embeddings with the weighted sub-symbolic embeddings to form at least the portion of the multi-modal embeddings; generating a natural language response to the natural language query based on the multi-modal embeddings by: generating an inference query based on the natural language query, the inference query indicative of the at least one of the content classifications; selecting, from the multi-modal embeddings, particular multi-modal embeddings associated with at least one of the content classifications; determining an inference statement based on a distance measurement between the particular multi-modal embeddings; and determining the natural language response based on the inference statement; and displaying, in response to receipt of the natural language query and the input image, the natural language response.

2

2. The method of claim 1 , wherein aggregating the symbolic embeddings with the weighted sub-symbolic embeddings to form the multi-modal embeddings further comprises: concatenating a first vector from the symbolic embeddings with a second vector from the weighted sub-symbolic embeddings to form a multi-modal vector.

3

3. The method of claim 1 , wherein determining an inference statement based on a distance measurement between the particular multi-modal embeddings further comprises: generating a plurality of candidate statements, each of the candidate statements referencing at least one node and at least one edge of the scene graph; selecting, from the multi-modal embeddings, groups of multi-modal embeddings based on the at least one node and the at least one edge of the scene graph; determining respective scores for the plurality of candidate statements based on distance measurements between multi-modal embeddings in each of the groups; selecting, based on the respective scores, at least one of the candidate statements; and generating the natural language response based on the selected at least one of the selected candidate statements.

4

4. The method of claim 3 , wherein selecting, based on the respective scores, at least one of the candidate statements further comprises: selecting a candidate statement associated with a highest one of the respective scores.

5

5. The method of claim 1 , further comprising: enriching the scene graph by appending additional nodes to the scene graph with nodes being sourced from a background knowledge graph; and generating the multi-modal embeddings based on the input image and the enriched scene graph.

6

6. The method of claim 5 , wherein enriching the scene graph by appending additional nodes to the scene graph with the additional nodes being sourced from a background knowledge graph further comprises: identifying, in the background knowledge graph, a first node of the scene graph that corresponds to a second node of the background knowledge graph; selecting further nodes of the background knowledge graph that are connected with the second node of the background knowledge graph, wherein the selected further nodes are not included in the non-enriched scene graph; and appending the selected further nodes to the scene graph.

7

7. The method of claim 1 , further comprising: generating a graphical user interface, the graphical user interface comprising the input image and a text field; determining that the natural language query was inserted into the text field; and updating the graphical user interface to include the natural language response.

8

8. A system for visual question inference, the system comprising: a processor, the processor configured to: receive an input image and a natural language query; generate a scene graph for the input image, the scene graph comprising content classifications of image data for the input image, the content classifications being arranged in a graph data structure comprising nodes and edges, the nodes respectively representative of the content classifications for the input image and the edges representative of relationships between the content classifications; determine symbolic embeddings for the input image, the symbolic embeddings representative of nodes of the scene graph, edges of the scene graph, or any combination thereof; determine a sub-symbolic embeddings for the input image, the sub-symbolic embeddings comprising respective image feature vectors for the input image; aggregate the symbolic embeddings with the sub-symbolic embeddings to form multi-modal embeddings in a multi-modal embedding space; identify at least one node and at least one edge in the scene graph based on natural language included in the natural language query, the natural language text being indicative of at least one of the content classifications; select, from the multi-modal embeddings, particular multi-modal embeddings associated with the at least one of the content classifications; determine an inference statement based on a distance measurement between the selected multi-modal embeddings; generate a natural language response based on the inference statement; and display the natural language in response on a graphical user interface.

9

9. The system of claim 8 , wherein to determine a sub-symbolic embeddings for the input image, the processor is further configured to: determine at least a portion of the input image that corresponds to the content classifications; generate an initial image feature vector for the input image; identify separate portions of the initial image feature vector, the separate portions of the initial image feature vector being representative of the at least the portion of the input image; apply weight values to the separate portions of the image feature vector; extract the separate weighted portions of the image feature vector; and generate the respective image feature vectors of the sub-symbolic embeddings, wherein each of the respective image feature vectors comprise a corresponding one of the separate weighted portions of the image feature vector.

10

10. The system of claim 8 , wherein to aggregate the symbolic embeddings with the sub-symbolic embeddings to form multi-modal embeddings in a multi-modal embedding space, the processor is further configured to: concatenate a first feature vector from the symbolic embeddings with a second feature vector from the sub-symbolic embeddings to form a combined vector.

11

11. The system of claim 8 , to determine an inference statement based on a distance measurement between the selected multi-modal embeddings, the processor is further configured to: generate a plurality of candidate statements, each of the candidate statements referencing at least one node and at least one edge of the scene graph; select, from the multi-modal embeddings, groups of multi-modal embeddings based on the at least one node and the at least one edge of the scene graph; determine respective scores for the plurality of candidate statements based on distance measurements between multi-modal embeddings in each of the groups of multi-modal embeddings; select, based on the respective scores, at least one of the candidate statements; and generate the natural language response based on the selected at least one of the candidate statements.

12

12. The system of claim 8 , wherein the processor is further configured to: enrich the scene graph by appending additional nodes to the scene graph with nodes being sourced from a background knowledge graph, wherein to enrich the scene graph, the processor is further configured to: identify, in the background knowledge graph, a first node of the scene graph that corresponds to a second node of the background knowledge graph; select further nodes of the background knowledge graph that are connected with the second node of the background knowledge graph, wherein the selected further nodes are not included in the non-enriched scene graph; and append the selected further nodes to the scene graph.

13

13. The system of claim 8 , further wherein the weight value to be applied is wherein the scene graph is generated based on symbolic features in the input image.

14

14. The system of claim 8 , further wherein the weight value to be applied is determined based on a region of the input image.

15

15. A non-transitory computer readable storage medium comprising computer executable instructions, the instructions executable by a processor to: receive an input image and a natural language query; generate a scene graph for the input image, the scene graph comprising content classifications of image data for the input image, the content classifications being arranged in a graph data structure comprising nodes and edges, the nodes respectively representative of the content classifications for the input image and the edges representative of relationships between the content classifications; determine symbolic embeddings for the input image, the symbolic embeddings representative of nodes of the scene graph, edges of the scene graph, or any combination thereof; determine a sub-symbolic embeddings for the input image, the sub-symbolic embeddings comprising respective image feature vectors for the input image; aggregate the symbolic embeddings with the sub-symbolic embeddings to form multi-modal embeddings in a multi-modal embedding space; identify at least one node and at least one edge in the scene graph based on natural language included in the natural language query, the natural language text being indicative of at least one of the content classifications; select, from the multi-modal embeddings, particular multi-modal embeddings associated with the at least one of the content classifications; determine an inference statement based on a distance measurement between the selected multi-modal embeddings; generate a natural language response based on the inference statement; and display the natural language in response on a graphical user interface.

16

16. The non-transitory computer readable storage medium of claim 15 , wherein the instructions executable by the processor to determine a sub-symbolic embeddings for the input image further comprise instructions executable by the processor to: determine portions of the input image that correspond to the content classifications; generate an initial image feature vector for the input image; identify separate portions of the initial image feature vector that are representative of the portions of the input image; apply weight values to the separate portions of the image feature vector; extract the separate weighted portions of the image feature vector; and generate the respective image feature vectors of the sub-symbolic embeddings, wherein each of the respective image feature vectors comprise a corresponding one of the separate weighted portions of the image feature vector.

17

17. The non-transitory computer readable storage medium of claim 15 , wherein the instructions executable by the processor to aggregate the symbolic embeddings with the sub-symbolic embeddings to form multi-modal embeddings further comprise instructions executable by the processor to: concatenate a first feature vector from the symbolic embeddings with a second feature vector from the sub-symbolic embeddings to form a combined vector.

18

18. The non-transitory computer readable storage medium of claim 15 , wherein the instructions executable by the processor to determine an inference statement based on a distance measurement between the selected multi-modal embeddings further comprise instructions executable by the processor to: generate a plurality of candidate statements, each of the candidate statements referencing at least one node and at least one edge of the scene graph; select, from the multi-modal embeddings, groups of multi-modal embeddings based on the at least one node and the at least one edge of the scene graph; determine respective scores for the plurality of candidate statements based on distance measurements between multi-modal embeddings in each of the groups of multi-modal embeddings; select, based on the respective scores, at least one of the candidate statements; and generate the natural language response based on the selected at least one of the candidate statements.

19

19. The non-transitory computer readable storage medium of claim 15 , further comprising instructions executable by the processor to: enrich the scene graph by appending additional nodes to the scene graph with nodes being sourced from a background knowledge graph, wherein the multi-modal embeddings are based on the input image and the enriched scene graph.

20

20. The non-transitory computer readable storage medium of claim 19 , wherein the instructions executable by the processor to enrich the scene graph by appending nodes to the scene graph with nodes from a background knowledge graph further comprise instructions executable by the processor to: identify, in the background knowledge graph, a first node of the scene graph that corresponds to a second node of the background knowledge graph; select further nodes of the background knowledge graph that are connected with the second node of the background knowledge graph; and append the selected further nodes to the scene graph.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

May 8, 2019

Publication Date

March 16, 2021

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Multi-modal visual question answering system” (US-10949718). https://patentable.app/patents/US-10949718

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.