The present disclosure relates to speech recognition systems and methods that enable personalized vocal user interfaces. More specifically, the present disclosure relates to combining a self-learning speech recognition system based on semantics with a speech-to-text system optionally integrated with a natural language processing system. The combined system has the advantage of automatically and continually training the semantics-based speech recognition system and increasing recognition accuracy.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for performing speech recognition, the method comprising: obtaining an input acoustic signal; providing the input acoustic signal to each of: a text-independent speech-to-intent (STI) system to determine a predicted intent; and a speech-to-text automatic speech recognition (ASR) system to determine predicted text; and using the predicted intent the predicted text and prediction confidence scores corresponding to each of the text-independent STI and ASR systems to map the acoustic signal to a desired user intent or action.
2. The method of claim 1 , further comprising generating a semantic representation and providing the semantic representation as feedback for subsequent training of the text-independent STI system.
3. The method of claim 2 , further comprising providing a text output when no semantic representation can be derived.
4. The method of claim 1 , further comprising determining and using a context of the system when the acoustic signal is received from a user.
5. The method of claim 1 , further comprising performing the desired user intent or action when the mapping is successful.
6. The method of claim 1 , further comprising providing a transcription to another application when the transcription is intended by the user.
7. The method of claim 1 , further comprising using a decision fusion matrix to integrate the predicted intent and the predicted text, to output a most likely semantic output.
8. The method of claim 7 , wherein the context includes any one of more of: an identity of the speaker, a previous conversation history, a state of system, a time of day, a state and history of one or more connected devices or applications, background noise, a state and history of one or more connected sensors, a speed of a vehicle.
9. The method of claim 7 , further comprising using a feedback loop to enroll one or more new commands into the text-independent STI system based on the output of the ASR system and the decision matrix.
10. The method of claim 2 , wherein the semantic representations are generated from automatically analyzing outputs of the ASR system, associated with the input acoustic signal.
11. The method of claim 1 , wherein the input acoustic signal is a voice signal.
12. The method of claim 1 , wherein the text-independent STI system is configured for decoding the input acoustic signal into useful semantic representations using one or more of non-negative matrix factorization (NMF), deep neural networks (DNN), recurrent neural networks (RNN) including long-short term memory (LSTM) or gated recurrent units (GRU), convolutional neural networks (CNN), hidden Markov models (HMM), histogram of acoustic co-occurrences (HAG), or auto-encoders (AE).
13. The method of claim 1 , wherein the ASR module is configured for decoding the input acoustic signal into useful text representations using one or more of nonnegative matrix factorization (NMF), deep neural networks (DNN), recurrent neural networks (RNN) including long-short term memory (LSTM) or gated recurrent units (GRU), convolutional neural networks (CNN), hidden Markov models (HMM), natural language processing (NLP), natural language understanding (NLU), and auto-encoders (AE).
14. The method of claim 1 , further comprising using semantic concepts corresponding to relevant semantics that a user refers to when controlling or addressing a device or object by voice using a vocal user interface (VUI).
15. The method of claim 1 , further comprising learning new synonyms referring to same actions, or new acoustic words corresponding to new actions or intents, and using the new synonyms or new acoustic words to adapt a model, a library, or both the model and the library.
16. The method of claim 2 , wherein the semantic representations are generated from user actions performed on an alternate non-vocal user interface.
17. The method of claim 16 , wherein the alternative non-vocal user interface includes any one or more of buttons, a touchscreen, a keyboard, a mouse with associated graphical user interface (GUI).
18. The method of claim 2 , wherein the semantic representations are predefined and a vector is composed in which entries represent a presence or absence in the input acoustic signal referring to one of the predefined semantic representations.
19. The method of claim 18 , wherein the vector is a fixed length vector.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
December 1, 2015
December 29, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.