Techniques for implementing multiple wakeword detectors on a single device are described. A digital signal processor (DSP) of the device may implement a wakeword detection component to detect when captured speech includes a wakeword. A companion application installed on the device may implement a wakeword detection component trained using speech of a user of the device. If the DSP's wakeword detection component detects a wakeword in speech, the companion application's wakeword detection component may be used to determine whether the wakeword was spoken by the user of the device. If the companion application's wakeword detection component determines the user spoke the wakeword, audio data representing the speech may be sent to at least one server(s) for processing.
Legal claims defining the scope of protection, as filed with the USPTO.
. A computer-implemented method, comprising:
. The computer-implemented method of, wherein receiving the input data comprises receiving audio data representing user speech.
. The computer-implemented method of, wherein implementing the first machine learning model occurs prior to receiving the input data.
. The computer-implemented method of, wherein the first machine learning model comprises a wakeword detection model.
. The computer-implemented method of, further comprising:
. The computer-implemented method of, wherein the first data corresponds to a device associated with the first user.
. The computer-implemented method of, wherein device corresponds to a key.
. The computer-implemented method of, wherein the first machine learning model is associated with a first speech processing component of a plurality of speech processing components.
. The computer-implemented method of, wherein the first machine learning model is associated with a first companion application of a plurality of applications.
. The computer-implemented method of, further comprising:
. A system comprising:
. The system of, wherein receiving the input data comprises receiving audio data representing user speech.
. The system of, wherein implementing the first machine learning model occurs prior to receiving the input data.
. The system of, the first machine learning model comprises a wakeword detection model.
. The system of, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to:
. The system of, wherein the first data corresponds to a device associated with the first user.
. The system of, wherein device corresponds to a key.
. The system of, wherein the first machine learning model is associated with a first speech processing component of a plurality of speech processing components.
. The system of, wherein the first machine learning model is associated with a first companion application of a plurality of applications.
. The system of, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to:
Complete technical specification and implementation details from the patent document.
This application is a continuation of, and claims priority to, U.S. patent application Ser. No. 18/323,625, entitled “WAKEWORD DETECTION,” filed on May 25, 2023, which is a continuation of U.S. patent application Ser. No. 17/493,188, entitled “WAKEWORD DETECTION,” filed on Oct. 4, 2021, now issued as U.S. Pat. No. 11,694,679, which is a continuation of U.S. patent application Ser. No. 16/778,057, entitled “WAKEWORD DETECTION,” filed on Jan. 31, 2020, now issued as U.S. Pat. No. 11,151,988, which is a continuation of U.S. patent application Ser. No. 16/017,160, entitled “WAKEWORD DETECTION,” filed on Jun. 25, 2018, now issued U.S. Pat. No. 10,762,896. The contents of the above applications are expressly incorporated herein by reference in their entirety.
Speech recognition systems have progressed to the point where humans can interact with computing devices using their voices. Such systems employ techniques to identify the words spoken by a human user based on the various qualities of a received audio input. Speech recognition processing combined with natural language understanding processing enable speech-based user control of a computing device to perform tasks based on the user's spoken commands. The combination of speech recognition processing and natural language understanding processing techniques is referred to herein as speech processing. Speech processing may also involve converting a user's speech into text data which may then be provided to speechlets.
Speech processing may be used by computers, hand-held devices, telephone computer systems, kiosks, and a wide variety of other devices to improve human-computer interactions.
Automatic speech recognition (ASR) is a field of computer science, artificial intelligence, and linguistics concerned with transforming audio data representing speech into text data representative of that speech. Natural language understanding (NLU) is a field of computer science, artificial intelligence, and linguistics concerned with enabling computers to derive meaning from text data containing natural language. Text-to-speech (TTS) is a field of computer science, artificial intelligence, and linguistics concerned with enabling computers to output synthesized speech. ASR, NLU, and TTS may be used together as part of a speech processing system.
Certain systems may be configured to perform actions responsive to user inputs. For example, a system may be configured to output weather information in response to a user input corresponding to “Alexa, what is the weather.” For further example, a system may be configured to output music performed by Adele in response to a user input corresponding to “Alexa, play Adele music.”
A device may be configured to receive a spoken user input, detect a wakeword (such as a keyword) in the user input, and perform an action in response to detecting the wakeword. For example, in response to the device detecting the wakeword, the device may send audio data, representing the user input, to a server(s) for processing (e.g., speech processing and command processing).
Certain devices may be configured with one or more wakeword detectors that need to be trained with respect to a particular user of the device. The user may be requested to speak one or more particular utterances, words, or the like to the device. The device uses the requested speech to train the wakeword detector(s).
The present disclosure improves present devices by implementing multiple wakeword detectors on a single device and selectively using the wakeword detectors at different times. Such results in greater accuracy of wakeword detection and provides greater control over battery usage of the device.
When a device is manufactured, the device may include a digital signal processor (DSP) configured with an untrained wakeword detector. At some point in time, the user may download a companion application to the device.
A companion application enables a device, not previously in communication with a server(s), to communicate with the server(s) via the companion application. An example companion application is the Amazon Alexa application that may be installed on various smart phones, tablets, etc.
After the companion application is downloaded, the wakeword detector of the DSP may be trained using the user's speech. The user may be requested to speak particular utterances to the device and the device may use the utterances to train the wakeword detector of the DSP. The trained wakeword detector may be wholly implemented by the DSP or may be distributed across the DSP and a software level of the device.
At some point in time, the companion application may be configured to implement its own wakeword detector. The companion application's wakeword detector may be pushed to the device in the form of a software update. The companion application's wakeword detector may need to be trained with respect to the user of the device prior to it being used at runtime.
The wakeword detector of the DSP (and optionally implemented in the application software level of the device) may be used at runtime until the companion application's wakeword detector is trained using speech of the user. When the wakeword detector of the DSP (and optionally application software level) detects a wakeword, the wakeword detector makes audio data representing the spoken wakeword accessible to the companion application's wakeword detector for training purposes.
The companion application's wakeword detector may require audio data representing numerous spoken wakewords in order for the companion application's wakeword detector to be trained. Once the companion application's wakeword detector is trained, the device may deactivate the user trained model(s) of the DSP's (and optionally application software level's) wakeword detector. The companion application's wakeword detector may be considered better than the wakeword detector implemented in the DSP (and optionally application software level) because the companion application's wakeword detector may have access to additional input signals.
A system implementing the present disclosure may require user permission to perform the teachings herein. That is, a system may require a user opt in, with informed consent, prior to the system being able to implement the teachings herein with respect to the user.
illustrate a system configured to detect user inputs using different wakeword detectors. Although the figures and discussion of the present disclosure illustrate certain operational steps of the system in a particular order, the steps described may be performed in a different order (as well as certain steps removed or added) without departing from the intent of the disclosure. A devicelocal to a usermay communicate with one or more serversacross one or more networks.
At some point in time, the devicemay detect () a wakeword in received audio. An example wakeword is “Alexa.” One or more different wakewords may also be used. After detecting the wakeword, the devicemay determine whether a companion application is installed on the device. Such companion application may enable the deviceto communicate with the server(s)via the network(s). If the devicedetermines the companion application is already installed, the devicemay provide the userwith a prompt to provide the user's login information. The requested login information may be associated with a user profile stored by the server(s). Once the devicereceives the login information, the devicemay send the login information (or an indication of same) to the server(s). The server(s)may then update a user profile (associated with the login information in a profile storagediscussed in detail below) to include updated endpoint information.
If the devicedetermines () the companion application is not installed on the device, the devicemay prompt the userto indicate whether the userwants the companion application downloaded on the device. Thereafter, the devicemay receive () a user input to install the companion application on the device. The user input may correspond to speech of the user or the user input may be a tactile input (e.g., selection of a virtual button on a touch sensitive interface of the device). The devicemay download () the companion application from an application catalog and may install () the companion application in an application software level of the device.
After the companion application is installed, the devicemay receive () user input requesting the companion application be associated with a particular user account. For example, the user input may correspond to a login request or a request to create a new user account.
The device may initially be configured with a first wakeword detection component that is configured to detect a wakeword, generally. That is, the first wakeword detection component may not be trained with respect to any particular user.
After the devicereceives the user input requesting the companion application be associated with the particular user account, the devicemay output () content requesting training of the first wakeword detection component using the user's speech. The content may be output as audio and/or may be displayed on a display associated with the device.
The devicereceives () user input representing permission to train the first wakeword detection component using the user's speech. The permission may be spoken by the user(in which case the speech may be captured by the deviceas audio) or may be provided via a touch-sensitive surface of the device.
The devicethereafter outputs () content requesting the user provide particular speech a threshold number of times. The content may be output as audio and/or may be displayed on a display associated with the device. For example, the content may request the user speak a particular wakeword a number of times needed to train the first wakeword detection component with respect to how the user speaks the wakeword.
A wakeword is a word spoken by a user to elicit a certain function from a device or system. A keyword is an example of a specialized wakeword. For a wakeword, the associated function is typically to “wake” the deviceso that the devicemay capture audio following (or surrounding) the wakeword and send audio data to the server(s)for speech processing. For speech processing enabled systems, the wakeword may be the only wakeword recognized by the system and all other words are processed using typical speech processing. In systems where other wakewords may be enabled, each respective wakeword may only be associated with a single respective function that is executed regardless of the operating context of the device. For example, saying “Alexa” (a wakeword) may activate speech processing components regardless of whatever else the system is doing. In another example, “shutdown” may be a configured keyword to shut off the system, also regardless of whatever else the system is doing.
The devicereceives () audio representing the requested speech and trains () the first wakeword detection component using audio data representing the received audio. The first wakeword detection component may be implemented in a DSP of the device. The first wakeword detection component may also be distributed across the DSP and an executable application (that may be related to the DSP). For example, the DSP may be configured to use hardware to implement certain functionality using hardware while an executable application may be configured to implement certain functionality using software/firmware. Depending on the configuration of the device, the executable application may include an Android Package (APK) of the device. A DSP APK is a package file format used by the Android operating system for distribution and installation of mobile applications and middleware. The executable application may also include an iPhone Application (for use on an iPhone), windows mobile executable (for use on a windows device), or other application. In one configuration, the executable application may be installed on the device by an original equipment manufacturer. Accordingly, one skilled in the art will appreciate that a deviceaccording to the present disclosure may implement a non-Android operating system and/or other software, firmware, or the like to perform the operations described herein.
While it is described that training of the first wakeword detection component may occur after the companion application is installed on the device, one skilled in the art will appreciate that the first wakeword detection component of the DSP (and optionally the executable application) may be trained prior to download and install of the companion application on the device. For example, steps-may be performed during an initial setup of the device.
After the first wakeword detection component is trained, the usermay speak a user input. The devicereceives () audio representing the spoken user input and generates () a vector representing the audio. The devicemay use the trained first wakeword detection component (implemented in the DSP and optionally the application software level) to determine () the vector corresponds to a first wakeword audio signature specific to the trained first wakeword detection component. In some examples, the vector may be determined to correspond to the first wakeword audio signature if a similarity between the vector and the first wakeword audio signature satisfies a similarly threshold.
At some point, the companion application may become configured to implement its own wakeword detection component. The companion application's wakeword detection component may be trained using speech of the user. If the companion application is configured with its own wakeword detection component, the devicemay use () audio data, representing the spoken user input, or the vector to at least partially train the second wakeword detection component implemented by the companion application with respect to how the user speaks the wakeword.
The devicesends (), via the companion application, the audio data (representing the spoken user input) to the server(s). The server(s)performs speech processing on the second audio data and determines first content responsive to the user input represented in the second audio data. The devicereceives () the first content from the server(s), for example via the companion application. The devicethereafter outputs () the first content to the user.
The second wakeword component (implemented by the companion application) may require various sample audio data and/or vectors representing the wakeword, as spoken by the user, to become trained. Thus, stepsthroughmay be performed with respect to multiple user inputs until the second wakeword detection component is trained.
After the second wakeword detection component (implemented by the companion application of the device) is trained, the devicemay deactivate () the model(s) of the first wakeword detection component trained with respect to the user's voice (or may simply adjust the first wakeword detection component's accuracy). Such deactivation results in the first wakeword detection component functioning as it did prior to training using the user's voice. That is, about deactivation of the user trained model(s), the first wakeword detection component may detect spoken wakewords without specific regard to how the user speaks the wakeword.
By deactivating the user trained model(s) of the first wakeword detection component, the deviceis able to save on battery usage. In some instances, leaving the user trained model(s) of the first wakeword detection component activated but adjusting its accuracy may save on battery usage. By deactivating the user trained model(s) of the first wakeword detection component but by still implementing the first wakeword detection component, the first key wakeword word detection component may act as a gatekeeper to invocation of the second wakeword detection component. The first wakeword detection component may use less computing resources than the second wakeword detection component. Thus, invoking the first wakeword detection component each time speech is detected and only invoking the second wakeword detection component if the first wakeword detection component detects the wakeword in the speech may result in less battery usage than invoking the second wakeword detection component each time speech is detected.
After the second wakeword detection component is trained, the usermay speak a further user input. The devicereceives () the audio representing the further user input and generates () a vector representing the audio. The devicemay use the first wakeword detection component to determine () the vector corresponds to a wakeword. Thereafter, the devicemay use the trained second wakeword detection component to determine () the vector corresponds to a second wakeword audio signature specific to the trained second wakeword detection component (e.g., may determine the vectors corresponds to a voice of the user whose speech was used to train the second wakeword detection component). In some examples, the vector may be determined to correspond to the second wakeword audio signature if a similarity between the vector and the second wakeword audio signature satisfies a similarly threshold. Since the training data used to train the second wakeword detection component is different from the training data used to train the first wakeword detection component, the second wakeword audio signature used by the second wakeword detection component may be different from the first wakeword audio signature used by the first wakeword detection component.
The devicesends (), via the companion application, audio data (representing the further spoken user input) to the server(s). The server(s)performs speech processing on the third audio data and determines second content responsive to the user input represented in the third audio data. The devicereceives () the second content from the server(s), for example via the companion application. The devicethereafter outputs () the second content to the user. As can be appreciated, in certain embodiments the DSP, executable application, and companion application may be different components while in other embodiments certain of their functionality may be combined into fewer components.
The system may operate using various components as described in. The various components may be located on same or different physical devices. Communication between various components may occur directly or across a network(s).
An audio capture component(s), such as a microphone or array of microphones of a device, captures audio. The deviceprocesses audio data, representing the audio, to determine whether speech is detected. The devicemay use various techniques to determine whether audio data includes speech. In some examples, the devicemay apply voice activity detection (VAD) techniques. Such techniques may determine whether speech is present in audio data based on various quantitative aspects of the audio data, such as the spectral slope between one or more frames of the audio data; the energy levels of the audio data in one or more spectral bands; the signal-to-noise ratios of the audio data in one or more spectral bands; or other quantitative aspects. In other examples, the devicemay implement a limited classifier configured to distinguish speech from background noise. The classifier may be implemented by techniques such as linear classifiers, support vector machines, and decision trees. In still other examples, the devicemay apply Hidden Markov Model (HMM) or Gaussian Mixture Model (GMM) techniques to compare the audio data to one or more acoustic models in storage, which acoustic models may include models corresponding to speech, noise (e.g., environmental noise or background noise), or silence. Still other techniques may be used to determine whether speech is present in audio data.
Once speech is detected in audio data representing the audio, the devicemay use a wakeword detection componentto perform wakeword detection to determine when a user intends to speak an input to the device. The wakeword detection componentis a specific example of a wakeword detection component. This process may also be referred to as wakeword detection, with a wakeword being a specific example of a wakeword. An example wakeword is “Alexa.”
Wakeword detection is typically performed without performing linguistic analysis, textual analysis, or semantic analysis. Instead, the audio data representing the audiois analyzed to determine if specific characteristics of the audio data match preconfigured acoustic waveforms, audio signatures, or other data to determine if the audio data “matches” stored audio data corresponding to a wakeword.
Thus, the wakeword detection componentmay compare audio data to stored models or data to detect a wakeword. One approach for wakeword detection applies general large vocabulary continuous speech recognition (LVCSR) systems to decode audio signals, with wakeword searching being conducted in the resulting lattices or confusion networks. LVCSR decoding may require relatively high computational resources. Another approach for wakeword detection builds HMMs for each wakeword and non-wakeword speech signals, respectively. The non-wakeword speech includes other spoken words, background noise, etc. There can be one or more HMMs built to model the non-wakeword speech characteristics, which are named filler models. Viterbi decoding is used to search the best path in the decoding graph, and the decoding output is further processed to make the decision on wakeword presence. This approach can be extended to include discriminative information by incorporating a hybrid DNN-HMM decoding framework. In another example, the wakeword detection componentmay be built on deep neural network (DNN)/recursive neural network (RNN) structures directly, without HMM being involved. Such an architecture may estimate the posteriors of wakewords with context information, either by stacking frames within a context window for DNN, or using RNN. Follow-on posterior threshold tuning or smoothing is applied for decision making. Other techniques for wakeword detection, such as those known in the art, may also be used.
In an example, as illustrated in, the wakeword detection componentmay implement a Hidden Markov Model (HMM) with a foreground wakeword pathand a background speech/nonspeech path. The foreground wakeword pathrepresents states corresponding to phonemes of the wakeword. Although other wakewords may be used, in this example the wakeword is “Alexa” represented by phonemes of <AX> for the initial “A” sound of “Alexa,”<L> for the “L” sound, <EH> for the “E” sound, <K> and <S> for the combined “X” sound, followed by <AX> for the final “A” sound. Viterbi decoding may be performed for the competing foreground wakeword path and background speech/nonspeech path, and wakeword hypothesis may be triggered when a log-likelihood ratio of the foreground path versus the background path exceeds a predetermined threshold. Once the ratio exceed the predetermined threshold, features may be extracted from the audio data and fed into one or more second stage classifiers, which could be a support vector machine (SVM) or deep neural network (DNN). The second stage classifier(s) may determine if the features correspond to a wakeword or not.
Once the wakeword is detected, the devicemay “wake” and begin transmitting audio data, representing the audio, to the server(s). The audio datamay include data corresponding to the wakeword, or the portion of the audio corresponding to the wakeword may be removed by the deviceprior to sending the audio datato the server(s).
Upon receipt by the server(s), the audio datamay be sent to an orchestrator component. The orchestrator componentmay include memory and logic that enables the orchestrator componentto transmit various pieces and forms of data to various components of the system, as well as perform other operations as described herein.
The orchestrator componentsends the audio datato an ASR component. The ASR componenttranscribes the audio datainto text data. The text data output by the ASR componentrepresents one or more than one (e.g., in the form of an N-best list) ASR hypotheses representing speech represented in the audio data. The ASR componentinterprets the speech in the audio databased on a similarity between the audio dataand pre-established language models. For example, the ASR componentmay compare the audio datawith models for sounds (e.g., subword units, such as phonemes, etc.) and sequences of sounds to identify words that match the sequence of sounds of the speech represented in the audio data. The ASR componentsends the text data generated thereby to an NLU component, for example via the orchestrator component. The text data sent from the ASR componentto the NLU componentmay include a top scoring ASR hypothesis or may include an N-best list including multiple ASR hypotheses. An N-best list may additionally include a respective score associated with each ASR hypothesis represented therein. Each score may indicate a confidence of ASR processing performed to generate the ASR hypothesis with which the score is associated.
The NLU componentattempts to make a semantic interpretation of the phrase(s) or statement(s) represented in the text data input therein. That is, the NLU componentdetermines one or more meanings associated with the phrase(s) or statement(s) represented in the text data based on words represented in the text data. The NLU componentdetermines an intent representing an action that a user desires be performed as well as pieces of the text data that allow a device (e.g., the device, the server(s), a speechlet component, a skill server(s), etc.) to execute the intent. For example, if the text data corresponds to “play Adele music,” the NLU componentmay determine an intent that the system output music and may identify “Adele” as an artist. For further example, if the text data corresponds to “what is the weather,” the NLU componentmay determine an intent that the system output weather information associated with a geographic location of the device. In another example, if the text data corresponds to “turn off the lights,” the NLU componentmay determine an intent that the system turn off lights associated with the deviceor the user.
The NLU results data may be sent from the NLU component(which may include tagged text data, indicators of intent, etc.) to a speechlet component(s). If the NLU results data includes a single NLU hypothesis, the NLU componentmay send the NLU results data to the speechlet component(s)associated with the NLU hypothesis. If the NLU results data includes an N-best list of NLU hypotheses, the NLU componentmay send the top scoring NLU hypothesis to a speechlet component(s)associated with the top scoring NLU hypothesis.
A “speechlet component” may be software running on the server(s)that is akin to a software application running on a traditional computing device. That is, a speechlet componentmay enable the server(s)to execute specific functionality in order to provide data or produce some other requested output. The server(s)may be configured with more than one speechlet component. For example, a weather service speechlet component may enable the server(s)to provide weather information, a car service speechlet component may enable the server(s)to book a trip with respect to a taxi or ride sharing service, a restaurant speechlet component may enable the server(s)to order a pizza with respect to the restaurant's online ordering system, etc. A speechlet componentmay operate in conjunction between the server(s)and other devices, such as the device, in order to complete certain functions. Inputs to a speechlet componentmay come from speech processing interactions or through other interactions or input sources. A speechlet componentmay include hardware, software, firmware, or the like that may be dedicated to a particular speechlet componentor shared among different speechlet components.
A skill server(s)may communicate with a speechlet component(s)within the server(s)and/or directly with the orchestrator componentor with other components. A skill server(s)may be configured to perform one or more actions. An ability to perform such action(s) may sometimes be referred to as a “skill.” That is, a skill may enable a skill server(s)to execute specific functionality in order to provide data or perform some other action requested by a user. For example, a weather service skill may enable a skill service(s)to provide weather information to the server(s), a car service skill may enable a skill server(s)to book a trip with respect to a taxi or ride sharing service, an order pizza skill may enable a skill server(s)to order a pizza with respect to a restaurant's online ordering system, etc. Additional types of skills include home automation skills (e.g., skills that enable a user to control home devices such as lights, door locks, cameras, thermostats, etc.), entertainment device skills (e.g., skills that enable a user to control entertainment devices such as smart televisions), video skills, flash briefing skills, as well as custom skills that are not associated with any pre-configured type of skill.
The server(s)may be configured with a speechlet componentdedicated to interacting with the skill server(s).
Unless expressly stated otherwise, reference to a speechlet, speechlet device, or speechlet component may include a speechlet componentoperated by the server(s)and/or skill operated by the skill server(s). Moreover, the functionality described herein as a speechlet or skill may be referred to using many different terms, such as an action, bot, app, or the like.
The server(s)may include a TTS componentthat generates audio data (e.g., synthesized speech) from text data using one or more different methods. Text data input to the TTS componentmay come from a speechlet component, the orchestrator component, or another component of the system.
Unknown
March 31, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.