11244167

Generating a Response to a User Query Utilizing Visual Features of a Video Segment and a Query-Response-Neural Network

PublishedFebruary 8, 2022
Assigneenot available in USPTO data we have
Technical Abstract

Patent Claims
20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

1. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to: extract a query vector from a question corresponding to a video segment; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment; and generating textual-context vectors representing transcript text corresponding to the video segment; generate a query-context vector by combining the query vector, the visual-context vectors, and the textual-context vectors; generate candidate-response vectors representing candidate responses to the question; and select a response from the candidate responses by comparing the query-context vector to the candidate-response vectors.

2

2. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to: generate the visual-context vectors representing the visual features corresponding to the video segment by utilizing visual-feature layers from a query-response-neural network; and generate the textual-context vectors representing the transcript text corresponding to the video segment by utilizing transcript layers from the query-response-neural network.

3

3. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing a recurrent neural network and one or more attention mechanisms from posterior layers of a query-response-neural network.

4

4. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate a visual-context vector of the visual-context vectors by: extracting textual-feature embeddings from inner objects that correspond to detected outer objects; generating training-sample-textual-feature embeddings representing visual-feature categories for training-sample objects visible within videos; comparing the textual-feature embeddings with the training-sample-textual-feature embeddings; and based on comparing the textual-feature embeddings with the training-sample-textual-feature embeddings, generating the visual-context vector indicating a visual-feature category from among the visual-feature categories for a textual-feature embedding from the textual-feature embeddings.

5

5. The non-transitory computer-readable medium of claim 4 , further comprising instructions that, when executed by the at least one processor, cause the computing device to: compare the textual-feature embeddings with the training-sample-textual-feature embeddings by generating similarity scores indicating a similarity between particular textual-feature embeddings and particular training-sample-textual-feature embeddings; and identify the visual-feature category from among the visual-feature categories for the textual-feature embedding based on the similarity scores.

6

6. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing posterior layers from a query-response-neural network to: generate a hidden-feature vector based on the textual-context vectors utilizing a recurrent neural network; generate a precursor query-context vector based on the query vector and the hidden-feature vector utilizing a temporal-attention mechanism; and generate the query-context vector based on the precursor query-context vector and the visual-context vectors utilizing a spatial-attention mechanism.

7

7. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing posterior layers from a query-response-neural network to: generate a precursor query-context vector based on a subset of the visual-context vectors for a video frame of the video segment and the query vector by utilizing a spatial-attention mechanism; and generate the query-context vector based on the precursor query-context vector and a textual-context vector for the video frame from the textual-context vectors utilizing a recurrent neural network.

8

8. The non-transitory computer-readable medium of claim 7 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing the recurrent neural network comprising one or more gated recurrent units.

9

9. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to detect objects from the video segment by utilizing a detection neural network.

10

10. A system comprising: one or more memory devices comprising a video and a query-response-neural network; and at least one server configured to cause the system to: extract a query vector from a question corresponding to a video segment of the video utilizing question-network layers from the query-response-neural network; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment by utilizing visual-feature layers from the query-response-neural network; generating textual-context vectors representing transcript text corresponding to the video segment by utilizing transcript layers from the query-response-neural network; generate a query-context vector based on the query vector, the visual-context vectors, and the textual-context vectors by utilizing posterior layers from the query-response-neural network; generate candidate-response vectors representing candidate responses to the question utilizing response-network layers from the query-response-neural network; and select a response from the candidate responses based on a comparison of the query-context vector to the candidate-response vectors.

11

11. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector by utilizing a recurrent neural network and one or more attention mechanisms from the posterior layers.

12

12. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate a visual-context vector of the visual-context vectors by: extracting textual-feature embeddings from inner objects that correspond to detected outer objects; generating training-sample-textual-feature embeddings representing visual-feature categories for training-sample objects visible within videos; comparing the textual-feature embeddings with the training-sample-textual-feature embeddings; and based on comparing the textual-feature embeddings with the training-sample-textual-feature embeddings, generating the visual-context vector indicating a visual-feature category from among the visual-feature categories for a textual-feature embedding from the textual-feature embeddings.

13

13. The system of claim 12 , wherein the at least one server is further configured to cause the system to: compare the textual-feature embeddings with the training-sample-textual-feature embeddings by generating similarity scores indicating a similarity between particular textual-feature embeddings and particular training-sample-textual-feature embeddings; and identify the visual-feature category from among the visual-feature categories for the textual-feature embedding based on the similarity scores.

14

14. The system of claim 13 , wherein the at least one server is further configured to cause the system to identify the visual-feature category from among the visual-feature categories for the textual-feature embedding by identifying that the textual-feature embedding is associated with a similarity score satisfying a threshold similarity.

15

15. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector utilizing the posterior layers from the query-response-neural network by: generating a hidden-feature vector based on the textual-context vectors utilizing a recurrent neural network; generating a precursor query-context vector based on the query vector and the hidden-feature vector utilizing a temporal-attention mechanism; and generating the query-context vector based on the precursor query-context vector and the visual-context vectors utilizing a spatial-attention mechanism.

16

16. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector utilizing the posterior layers from the query-response-neural network by: generating a precursor query-context vector based on a subset of visual-context vectors for a video frame of the video segment and the query vector by utilizing a spatial-attention mechanism; and generating the query-context vector based on the precursor query-context vector and a textual-context vector for the video frame from the textual-context vectors utilizing a recurrent neural network.

17

17. The system of claim 10 , wherein the at least one server is further configured to cause the system to detect objects from the video segment by utilizing a detection neural network from the visual-feature layers.

18

18. A computer-implemented method comprising: extracting a query vector from a question corresponding to a video segment by utilizing question-network layers from a query-response-neural network; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment; and generating textual-context vectors representing transcript text corresponding to the video segment; performing a step for combining the query vector, the visual-context vectors, and the textual-context vectors from the video segment to form a query-context vector; generating candidate-response vectors representing candidate responses to the question utilizing response-network layers from the query-response-neural network; and selecting a response from the candidate responses based on a comparison of the query-context vector to the candidate-response vectors.

19

19. The computer-implemented method of claim 18 , wherein selecting the response from the candidate responses comprises: generating a matching score for each query-response pairing between the query-context vector and a respective candidate-response vector from the candidate-response vectors; and selecting the response based on a particular query-response pairing having a particular matching score satisfying a threshold matching score.

20

20. The computer-implemented method of claim 18 , wherein: generating the visual-context vectors comprises utilizing visual-feature layers from the query-response-neural network; and generating the textual-context vectors comprises utilizing transcript layers from the query-response-neural network.

Patent Metadata

Filing Date

Unknown

Publication Date

February 8, 2022

Inventors

Wentian Zhao
Seokhwan Kim
Ning Xu
Hailin Jin

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “GENERATING A RESPONSE TO A USER QUERY UTILIZING VISUAL FEATURES OF A VIDEO SEGMENT AND A QUERY-RESPONSE-NEURAL NETWORK” (11244167). https://patentable.app/patents/11244167

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.