Legal claims defining the scope of protection, as filed with the USPTO.
1. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to: extract a query vector from a question corresponding to a video segment; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment; and generating textual-context vectors representing transcript text corresponding to the video segment; generate a query-context vector by combining the query vector, the visual-context vectors, and the textual-context vectors; generate candidate-response vectors representing candidate responses to the question; and select a response from the candidate responses by comparing the query-context vector to the candidate-response vectors.
2. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to: generate the visual-context vectors representing the visual features corresponding to the video segment by utilizing visual-feature layers from a query-response-neural network; and generate the textual-context vectors representing the transcript text corresponding to the video segment by utilizing transcript layers from the query-response-neural network.
3. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing a recurrent neural network and one or more attention mechanisms from posterior layers of a query-response-neural network.
4. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate a visual-context vector of the visual-context vectors by: extracting textual-feature embeddings from inner objects that correspond to detected outer objects; generating training-sample-textual-feature embeddings representing visual-feature categories for training-sample objects visible within videos; comparing the textual-feature embeddings with the training-sample-textual-feature embeddings; and based on comparing the textual-feature embeddings with the training-sample-textual-feature embeddings, generating the visual-context vector indicating a visual-feature category from among the visual-feature categories for a textual-feature embedding from the textual-feature embeddings.
5. The non-transitory computer-readable medium of claim 4 , further comprising instructions that, when executed by the at least one processor, cause the computing device to: compare the textual-feature embeddings with the training-sample-textual-feature embeddings by generating similarity scores indicating a similarity between particular textual-feature embeddings and particular training-sample-textual-feature embeddings; and identify the visual-feature category from among the visual-feature categories for the textual-feature embedding based on the similarity scores.
6. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing posterior layers from a query-response-neural network to: generate a hidden-feature vector based on the textual-context vectors utilizing a recurrent neural network; generate a precursor query-context vector based on the query vector and the hidden-feature vector utilizing a temporal-attention mechanism; and generate the query-context vector based on the precursor query-context vector and the visual-context vectors utilizing a spatial-attention mechanism.
7. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing posterior layers from a query-response-neural network to: generate a precursor query-context vector based on a subset of the visual-context vectors for a video frame of the video segment and the query vector by utilizing a spatial-attention mechanism; and generate the query-context vector based on the precursor query-context vector and a textual-context vector for the video frame from the textual-context vectors utilizing a recurrent neural network.
8. The non-transitory computer-readable medium of claim 7 , further comprising instructions that, when executed by the at least one processor, cause the computing device to generate the query-context vector by utilizing the recurrent neural network comprising one or more gated recurrent units.
9. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computing device to detect objects from the video segment by utilizing a detection neural network.
10. A system comprising: one or more memory devices comprising a video and a query-response-neural network; and at least one server configured to cause the system to: extract a query vector from a question corresponding to a video segment of the video utilizing question-network layers from the query-response-neural network; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment by utilizing visual-feature layers from the query-response-neural network; generating textual-context vectors representing transcript text corresponding to the video segment by utilizing transcript layers from the query-response-neural network; generate a query-context vector based on the query vector, the visual-context vectors, and the textual-context vectors by utilizing posterior layers from the query-response-neural network; generate candidate-response vectors representing candidate responses to the question utilizing response-network layers from the query-response-neural network; and select a response from the candidate responses based on a comparison of the query-context vector to the candidate-response vectors.
11. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector by utilizing a recurrent neural network and one or more attention mechanisms from the posterior layers.
12. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate a visual-context vector of the visual-context vectors by: extracting textual-feature embeddings from inner objects that correspond to detected outer objects; generating training-sample-textual-feature embeddings representing visual-feature categories for training-sample objects visible within videos; comparing the textual-feature embeddings with the training-sample-textual-feature embeddings; and based on comparing the textual-feature embeddings with the training-sample-textual-feature embeddings, generating the visual-context vector indicating a visual-feature category from among the visual-feature categories for a textual-feature embedding from the textual-feature embeddings.
13. The system of claim 12 , wherein the at least one server is further configured to cause the system to: compare the textual-feature embeddings with the training-sample-textual-feature embeddings by generating similarity scores indicating a similarity between particular textual-feature embeddings and particular training-sample-textual-feature embeddings; and identify the visual-feature category from among the visual-feature categories for the textual-feature embedding based on the similarity scores.
14. The system of claim 13 , wherein the at least one server is further configured to cause the system to identify the visual-feature category from among the visual-feature categories for the textual-feature embedding by identifying that the textual-feature embedding is associated with a similarity score satisfying a threshold similarity.
15. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector utilizing the posterior layers from the query-response-neural network by: generating a hidden-feature vector based on the textual-context vectors utilizing a recurrent neural network; generating a precursor query-context vector based on the query vector and the hidden-feature vector utilizing a temporal-attention mechanism; and generating the query-context vector based on the precursor query-context vector and the visual-context vectors utilizing a spatial-attention mechanism.
16. The system of claim 10 , wherein the at least one server is further configured to cause the system to generate the query-context vector utilizing the posterior layers from the query-response-neural network by: generating a precursor query-context vector based on a subset of visual-context vectors for a video frame of the video segment and the query vector by utilizing a spatial-attention mechanism; and generating the query-context vector based on the precursor query-context vector and a textual-context vector for the video frame from the textual-context vectors utilizing a recurrent neural network.
17. The system of claim 10 , wherein the at least one server is further configured to cause the system to detect objects from the video segment by utilizing a detection neural network from the visual-feature layers.
18. A computer-implemented method comprising: extracting a query vector from a question corresponding to a video segment by utilizing question-network layers from a query-response-neural network; extract multiple contextual modalities from the video segment by: generating visual-context vectors representing visual features corresponding to the video segment; and generating textual-context vectors representing transcript text corresponding to the video segment; performing a step for combining the query vector, the visual-context vectors, and the textual-context vectors from the video segment to form a query-context vector; generating candidate-response vectors representing candidate responses to the question utilizing response-network layers from the query-response-neural network; and selecting a response from the candidate responses based on a comparison of the query-context vector to the candidate-response vectors.
19. The computer-implemented method of claim 18 , wherein selecting the response from the candidate responses comprises: generating a matching score for each query-response pairing between the query-context vector and a respective candidate-response vector from the candidate-response vectors; and selecting the response based on a particular query-response pairing having a particular matching score satisfying a threshold matching score.
20. The computer-implemented method of claim 18 , wherein: generating the visual-context vectors comprises utilizing visual-feature layers from the query-response-neural network; and generating the textual-context vectors comprises utilizing transcript layers from the query-response-neural network.
Unknown
February 8, 2022
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.