Methods and Apparatus for Reducing Latency in Speech Recognition Applications

PublishedMay 30, 2017

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

19 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A computing device including a speech-enabled application installed thereon, the computing device comprising: an input interface configured to receive first audio comprising speech from a user of the computing device; an automatic speech recognition (ASR) engine configured to: detect based, at least in part, on a threshold time for endpointing, an end of speech in the first audio; and generate a first ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech; and at least one processor programmed to: determine whether a valid action can be performed by the speech-enabled application using the first ASR result; instruct the ASR engine to process second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the first ASR result; create a first hint based, at least in part, on the first ASR result, wherein the first hint prompts the user for speech input corresponding to a valid action that can be performed by the speech-enabled application; and present the first hint via a user interface of the computing device.

2. The computing device of claim 1 , wherein determining whether a valid action can be performed by the speech-enabled application using the first ASR result is based, at least in part, on a natural language understanding (NLU) result generated using the first ASR result.

3. The computing device of claim 2 , wherein the processor is further programmed to submit the NLU result to the speech-enabled application, and wherein determining whether a valid action can be performed by the speech-enabled application using the first ASR result comprises receiving an indication from the speech-enabled application that a valid action cannot be performed in response to submitting the NLU result to the speech-enabled application.

4. The computing device of claim 1 , wherein the input interface is further configured to receive the second audio, wherein the second audio includes audio recorded after the detected end of speech in the first audio, and wherein the ASR engine is further configured to process the second audio.

5. The computing device of claim 4 , wherein processing the second audio comprises: determining whether the second audio includes speech; and generating a second ASR result based, at least in part, on at least a portion of the second audio in response to determining that the second audio comprises speech.

6. The computing device of claim 5 , wherein generating the second ASK result comprises generating the second ASR result based, at least in part, on at least a portion of the first audio and at least a portion of the second audio.

7. The computing device of claim 5 , wherein the at least one processor is further programmed to: determine whether a valid action can be performed by the speech-enabled application using a natural language understanding (NLU) result generated based, at least in part, on at least a portion of the first ASR result and at least a portion of the second ASR result; and instruct the speech-enabled application to perform the valid action in response to determining that the valid action can be performed using the NLU result.

8. The computing device of claim 1 , further comprising: at least one storage device configured to store one or more prefixes, each of which is associated with a corresponding threshold time for endpointing; and wherein determining whether a valid action can be performed by the speech-enabled application comprises determining whether the speech in the first audio includes a prefix of the one or more prefixes stored on the at least one storage device.

9. The computing device of claim 8 , wherein the ASR engine is further configured to: process a plurality of time segments of the first audio prior to detecting the end of speech in the first audio, and wherein determining whether the speech in the first audio includes a prefix stored on the at least one storage device comprises comparing output of the ASR engine determined based on the processed plurality of time segments to the one or more prefixes stored on the at least one storage device.

10. The computing device of claim 8 , wherein the at least one processor is further programmed to: update the threshold time used by the ASR engine for endpointing in response to determining that the speech in the first audio includes a prefix stored on the at least one storage device, wherein updating the threshold time comprises instructing the ASR engine to use the threshold time for endpointing associated with the prefix stored on the at least one storage device identified in the speech in the first audio to detect an end of speech in the first audio.

11. The computing device of claim 1 , wherein the input interface is further configured to receive the second audio, wherein the ASR engine is further configured to process the second audio to generate a second ASR result, and wherein the at least one processor is further programmed to: create a second hint based, at least in part, on the first ASR result, the second ASR result, or the first ASR result and the second ASR result, wherein the second hint prompts the user for speech input corresponding to a valid action that can be performed by the speech-enabled application; and present the second hint via the user interface of the computing device.

12. The computing device of claim 1 , wherein presenting the first hint via the user interface comprises visually displaying the first hint on the user interface, and wherein the first hint hints of additional information to supplement the first audio to perform the valid action.

13. The computing device of claim 1 , wherein the input interface is further configured to receive the second audio, wherein the ASR engine is further configured to process the second audio, wherein processing the second audio comprises performing ASR processing on the second audio based, at least in part, on information included in the first hint.

14. A computing device including a speech-enabled application installed thereon, the computing device comprising: at least one storage device configured to store at least one data structure including information describing a plurality of natural language understanding (NLU) results and corresponding ASR output used to generate the plurality of NW results; an input interface configured to receive first audio comprising speech from a user of the computing device; an automatic speech recognition (ASR) engine configured to: detect based, at least in part, on a threshold time for endpointing, an end of speech in the first audio; and generate a first ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech; and at least one processor programmed to: determine whether a valid action can be performed by the speech-enabled application using the first ASR result; instruct the ASR engine to process second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the first ASR result; determine whether to add the first ASR result and a corresponding NLU result generated using the first ASK result to the at least one data structure stored on the at least one storage device; and add the first ASR result and the corresponding NLU result generated using the first ASR result to the at least one data structure stored on the at least one storage device in response to determining that the first ASK result and the corresponding NLU result should be added.

15. The computing device of claim 14 , wherein determining whether to add the first ASR result and the corresponding NLU result generated using the first ASR result to the at least one data structure comprises: determining a number of times the corresponding NLU result has been received by the computing device from an NLU engine remotely located from the computing device; and determining that the first ASR result and the corresponding NLU result should be added to the at least one data structure when the number of times the corresponding NLU result has been received by the computing device exceeds a threshold value.

16. The computing device of claim 14 , wherein the input interface is further configured to receive third audio including speech from the user of the computing device, wherein the ASR engine is further configured to generate a second ASR result based, at least in pall, on at least a portion of the third audio, and wherein the processor is further programmed to: identify an ASR output stored in the at least one data structure corresponding to the second ASR result; and submit the NLU result corresponding to the identified ASR output stored in the at least one data structure to the speech-enabled application to enable the speech-enabled application to perform an action based on the submitted NLU result.

17. The computing device of claim 16 , wherein the at least one processor is programmed to submit the NLU result corresponding to the identified ASR output stored in the at least one data structure to the at least one data structure without sending a request for remote NLU processing of the third audio to an NLU engine remotely located from the computing device.

18. A method, comprising: receiving, by an input interface of a computing device, first audio comprising speech from a user of the computing device; detecting, by an automatic speech recognition (ASR) engine of the computing device, an end of speech in the first audio; generating, by the ASR engine, an ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech; determining whether a valid action can be performed by a speech-enabled application installed on the computing device using the ASR result; instructing the ASR engine to process second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the ASR result; creating a first hint based, at least in part, on the first ASR result, wherein the first hint prompts the user for speech input corresponding to a valid action that can be performed by the speech-enabled application; and presenting the first hint via a user interface of the computing device.

19. A non-transitory computer-readable storage medium encoded with a plurality of instructions that, when executed by a computing device, performs a method, the method comprising: receiving first audio comprising speech from a user of the computing device; detecting an end of speech in the first audio; generating an ASR result based, at least in part, on a portion of the first audio prior to the detected end of speech; determining whether a valid action can be performed by a speech-enabled application installed on the computing device using the ASR result; processing second audio when it is determined that a valid action cannot be performed by the speech-enabled application using the ASR result; creating a first hint based, at least in part, on the first ASR result, wherein the first hint prompts the user for speech input corresponding to a valid action that can be performed by the speech-enabled application; and presenting the first hint via a user interface of the computing device.

Patent Metadata

Filing Date

Unknown

Publication Date

May 30, 2017

Inventors

Mark Fanty

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search