Generating a Response to a User Query Utilizing Visual Features of a Video Segment and a Query-Response-Neural Network

PublishedFebruary 8, 2022

Assigneenot available in USPTO data we have

InventorsWentian Zhao Seokhwan Kim Ning Xu Hailin Jin

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to: extract a query vector from a question corresponding to a video segment; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment; and generating textual-context vectors representing transcript text corresponding to the video segment; generate a query-context vector by combining the query vector, the visual-context vectors, and the textual-context vectors; generate candidate-response vectors representing candidate responses to the question; and select a response from the candidate responses by comparing the query-context vector to the candidate-response vectors.

2. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to: generate the visual-context vectors representing the visual features corresponding to the video segment by utilizing visual-feature layers from a query-response-neural network; and generate the textual-context vectors representing the transcript text corresponding to the video segment by utilizing transcript layers from the query-response-neural network.

3. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing a recurrent neural network and one or more attention mechanisms from posterior layers of a query-response-neural network.

4. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate a visual-context vector of the visual-context vectors by: extracting textual-feature embeddings from inner objects that correspond to detected outer objects; generating training-sample-textual-feature embeddings representing visual-feature categories for training-sample objects visible within videos; comparing the textual-feature embeddings with the training-sample-textual-feature embeddings; and based on comparing the textual-feature embeddings with the training-sample-textual-feature embeddings, generating the visual-context vector indicating a visual-feature category from among the visual-feature categories for a textual-feature embedding from the textual-feature embeddings.

5. The non-transitory computer-readable medium of claim 4 , further comprising instructions that, when executed by the at least one processor, cause the computing device to: compare the textual-feature embeddings with the training-sample-textual-feature embeddings by generating similarity scores indicating a similarity between particular textual-feature embeddings and particular training-sample-textual-feature embeddings; and identify the visual-feature category from among the visual-feature categories for the textual-feature embedding based on the similarity scores.

6. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing posterior layers from a query-response-neural network to: generate a hidden-feature vector based on the textual-context vectors utilizing a recurrent neural network; generate a precursor query-context vector based on the query vector and the hidden-feature vector utilizing a temporal-attention mechanism; and generate the query-context vector based on the precursor query-context vector and the visual-context vectors utilizing a spatial-attention mechanism.

7. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing posterior layers from a query-response-neural network to: generate a precursor query-context vector based on a subset of the visual-context vectors for a video frame of the video segment and the query vector by utilizing a spatial-attention mechanism; and generate the query-context vector based on the precursor query-context vector and a textual-context vector for the video frame from the textual-context vectors utilizing a recurrent neural network.

8. The non-transitory computer-readable medium of claim 7 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing the recurrent neural network comprising one or more gated recurrent units.

9. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to detect objects from the video segment by utilizing a detection neural network.

10. A system comprising: one or more memory devices comprising a video and a query-response-neural network; and at least one server configured to cause the system to: extract a query vector from a question corresponding to a video segment of the video utilizing question-network layers from the query-response-neural network; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment by utilizing visual-feature layers from the query-response-neural network; generating textual-context vectors representing transcript text corresponding to the video segment by utilizing transcript layers from the query-response-neural network; generate a query-context vector based on the query vector, the visual-context vectors, and the textual-context vectors by utilizing posterior layers from the query-response-neural network; generate candidate-response vectors representing candidate responses to the question utilizing response-network layers from the query-response-neural network; and select a response from the candidate responses based on a comparison of the query-context vector to the candidate-response vectors.

11. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector by utilizing a recurrent neural network and one or more attention mechanisms from the posterior layers.

12. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate a visual-context vector of the visual-context vectors by: extracting textual-feature embeddings from inner objects that correspond to detected outer objects; generating training-sample-textual-feature embeddings representing visual-feature categories for training-sample objects visible within videos; comparing the textual-feature embeddings with the training-sample-textual-feature embeddings; and based on comparing the textual-feature embeddings with the training-sample-textual-feature embeddings, generating the visual-context vector indicating a visual-feature category from among the visual-feature categories for a textual-feature embedding from the textual-feature embeddings.

13. The system of claim 12 , wherein the at least one server is further configured to cause the system to: compare the textual-feature embeddings with the training-sample-textual-feature embeddings by generating similarity scores indicating a similarity between particular textual-feature embeddings and particular training-sample-textual-feature embeddings; and identify the visual-feature category from among the visual-feature categories for the textual-feature embedding based on the similarity scores.

14. The system of claim 13 , wherein the at least one server is further configured to cause the system to identify the visual-feature category from among the visual-feature categories for the textual-feature embedding by identifying that the textual-feature embedding is associated with a similarity score satisfying a threshold similarity.

15. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector utilizing the posterior layers from the query-response-neural network by: generating a hidden-feature vector based on the textual-context vectors utilizing a recurrent neural network; generating a precursor query-context vector based on the query vector and the hidden-feature vector utilizing a temporal-attention mechanism; and generating the query-context vector based on the precursor query-context vector and the visual-context vectors utilizing a spatial-attention mechanism.

16. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector utilizing the posterior layers from the query-response-neural network by: generating a precursor query-context vector based on a subset of visual-context vectors for a video frame of the video segment and the query vector by utilizing a spatial-attention mechanism; and generating the query-context vector based on the precursor query-context vector and a textual-context vector for the video frame from the textual-context vectors utilizing a recurrent neural network.

17. The system of claim 10 , wherein the at least one server is further configured to cause the system to detect objects from the video segment by utilizing a detection neural network from the visual-feature layers.

18. A computer-implemented method comprising: extracting a query vector from a question corresponding to a video segment by utilizing question-network layers from a query-response-neural network; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment; and generating textual-context vectors representing transcript text corresponding to the video segment; performing a step for combining the query vector, the visual-context vectors, and the textual-context vectors from the video segment to form a query-context vector; generating candidate-response vectors representing candidate responses to the question utilizing response-network layers from the query-response-neural network; and selecting a response from the candidate responses based on a comparison of the query-context vector to the candidate-response vectors.

19. The computer-implemented method of claim 18 , wherein selecting the response from the candidate responses comprises: generating a matching score for each query-response pairing between the query-context vector and a respective candidate-response vector from the candidate-response vectors; and selecting the response based on a particular query-response pairing having a particular matching score satisfying a threshold matching score.

20. The computer-implemented method of claim 18 , wherein: generating the visual-context vectors comprises utilizing visual-feature layers from the query-response-neural network; and generating the textual-context vectors comprises utilizing transcript layers from the query-response-neural network.

Patent Metadata

Filing Date

Unknown

Publication Date

February 8, 2022

Inventors

Wentian Zhao

Seokhwan Kim

Ning Xu

Hailin Jin

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search