A method may involve receiving output signals from each microphone of a plurality of microphones in the environment, each of the plurality of microphones residing in a microphone location of the environment, the output signals corresponding to an utterance of a person. The method may involve determining, based at least in part on the output signals, a zone within the environment that has at least a threshold probability of including the person's location and generating a plurality of spatially-varying attentiveness signals within the zone. Each attentiveness signal may be generated by a device located within the zone. Each attentiveness signal may indicate that a corresponding device is in an operating mode in which the corresponding device is awaiting a command and may indicate a relevance metric of the corresponding device.
Legal claims defining the scope of protection, as filed with the USPTO.
. A method of controlling a system of devices in an environment, the method comprising:
. The method of, wherein an attentiveness signal generated by a first device indicates a relevance metric of a second device, the second device being the corresponding device.
. The method of, wherein the relevance metric is based, at least in part, on an estimated visibility of the corresponding device.
. The method of, wherein the utterance comprises a wakeword.
. The method of, wherein at least one of the plurality of spatially-varying attentiveness signals comprises a modulation of at least one previous signal generated by the device located within the zone prior to a time of the utterance.
. The method of, wherein the at least one previous signal comprises a light signal and wherein the modulation comprises at least one of a color modulation, a color saturation modulation or a light intensity modulation.
. The method of, wherein at least one microphone of the plurality of microphones is included in or configured for communication with a smart audio device.
. The method of, further comprising an automated process of determining whether the device is in a device group.
. The method of, wherein the automated process is based, at least in part, on sensor data corresponding to at least one of light or sound emitted by the device, or, wherein the automated process is based, at least in part, on communications between at least one of a source and an orchestrating hub device or a receiver and the orchestrating hub device, or, wherein the automated process is based, at least in part, on a light source or a sound source being switched on and off for a duration of time.
. The method of, further comprising automatically updating the automated process according to implicit feedback based on one or more of a success of beamforming based on an estimated zone, a success of microphone selection based on the estimated zone, a determination that the person has terminated a response of a voice assistant abnormally, a command recognizer returning a low-confidence result or a second-pass retrospective wakeword detector returning low confidence that a wakeword was spoken.
. The method of, further comprising selecting at least one speaker of the device located within the zone and controlling the at least one speaker to provide sound to the person.
. The method of, further comprising selecting at least one microphone of the device located within the zone and providing signals output by the at least one microphone to a smart audio device.
. The method of, wherein a first microphone of the plurality of microphones samples audio data according to a first sample clock and a second microphone of the plurality of microphones samples the audio data according to a second sample clock.
. An apparatus configured to perform the method of.
. A system of devices configured to perform the method of.
. One or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to perform the method of.
Complete technical specification and implementation details from the patent document.
This application claims priority to U. S. Provisional Patent Application No. 62/880,110 filed 30 Jul. 2019; U.S. Provisional Patent Application No. 62/880,112 filed 30 Jul. 2019; U.S. Provisional Patent Application No. 62/964,018 filed 21 Jan. 2020; and U.S. Provisional Patent Application No. 63/003,788 filed 1 Apr. 2020, which are incorporated herein by reference.
This disclosure pertains to systems and methods for automatically controlling a plurality of smart audio devices in an environment.
Audio devices, including but not limited to smart audio devices, have been widely deployed and are becoming common features of many homes. Although existing systems and methods for controlling audio devices provide benefits, improved systems and methods would be desirable.
Herein, we use the expression “smart audio device” to denote a smart device which is either a single purpose audio device or a virtual assistant (e.g., a connected virtual assistant). A single purpose audio device is a device (e.g., a smart speaker, a television (TV) or a mobile phone) including or coupled to at least one microphone (and which may in some examples also include or be coupled to at least one speaker) and which is designed largely or primarily to achieve a single purpose. Although a TV typically can play (and is thought of as being capable of playing) audio from program material, in most instances a modern TV runs some operating system on which applications run locally, including the application of watching television. Similarly, the audio input and output in a mobile phone may do many things, but these are serviced by the applications running on the phone. In this sense, a single purpose audio device having speaker(s) and microphone(s) is often configured to run a local application and/or service to use the speaker(s) and microphone(s) directly. Some single purpose audio devices may be configured to group together to achieve playing of audio over a zone or user-configured area.
Herein, a “virtual assistant” (e.g., a connected virtual assistant) is a device (e.g., a smart speaker, a smart display or a voice assistant integrated device) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker) and which may provide an ability to utilize multiple devices (distinct from the virtual assistant) for applications that are in a sense cloud enabled or otherwise not implemented in or on the virtual assistant itself. Virtual assistants may sometimes work together, e.g., in a very discrete and conditionally defined way. For example, two or more virtual assistants may work together in the sense that one of them, i.e., the one which is most confident that it has heard a wakeword, responds to the word. Connected devices may form a sort of constellation, which may be managed by one main application which may be (or include or implement) a virtual assistant.
Herein, “wakeword” is used in a broad sense to denote any sound (e.g., a word uttered by a human, or some other sound), where a smart audio device is configured to awake in response to detection of (“hearing”) the sound (using at least one microphone included in or coupled to the smart audio device, or at least one other microphone). In this context, to “awake” denotes that the device enters a state in which it awaits (i.e., is listening for) a sound command. In some instances, what may be referred to herein as a “wakeword” may include more than one word, e.g., a phrase.
Herein, the expression “wakeword detector” denotes a device configured (or software that includes instructions for configuring a device) to search continuously for alignment between real-time sound (e.g., speech) features and a trained model. Typically, a wakeword event is triggered whenever it is determined by a wakeword detector that the probability that a wakeword has been detected exceeds a predefined threshold. For example, the threshold may be a predetermined threshold which is tuned to give a good compromise between rates of false acceptance and false rejection. Following a wakeword event, a device might enter a state (which may be referred to as an “awakened” state or a state of “attentiveness”) in which it listens for a command and passes on a received command to a larger, more computationally-intensive recognizer.
Throughout this disclosure, including in the claims, “speaker” and “loudspeaker” are used synonymously to denote any sound-emitting transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), all driven by a single, common speaker feed. The speaker feed may, in some instances, undergo different processing in different circuitry branches coupled to the different transducers.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
At least some aspects of the present disclosure may be implemented via methods, such as methods of controlling a system of devices in an environment. In some instances, the methods may be implemented, at least in part, by a control system such as those disclosed herein. Some such methods may involve receiving output signals from each microphone of a plurality of microphones in the environment. Each of the plurality of microphones may reside in a microphone location of the environment. The output signals may, in some examples, correspond to an utterance of a person. According to some examples, at least one of the microphones may be included in or configured for communication with a smart audio device. In some instances, a first microphone of the plurality of microphones may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to a second sample clock.
Some such methods may involve determining, based at least in part on the output signals, a zone within the environment that has at least a threshold probability of including the person's location. Some such methods may involve generating a plurality of spatially-varying attentiveness signals within the zone. In some instances, each attentiveness signal of the plurality of attentiveness signals may be generated by a device located within the zone. Each attentiveness signal may, for example, indicate that a corresponding device is in an operating mode in which the corresponding device is awaiting a command. In some examples, each attentiveness signal may indicate a relevance metric of the corresponding device.
In some implementations, an attentiveness signal generated by a first device may indicate a relevance metric of a second device. The second device may, in some examples, be a device corresponding to the first device. In some instances, the utterance may be, or may include, a wakeword. According to some such examples, the attentiveness signals vary, at least in part, according to estimations of wakeword confidence.
According to some examples, at least one of the attentiveness signals may be a modulation of at least one previous signal generated by a device within the zone prior to a time of the utterance. In some instances, the at least one previous signal may be, or may include, a light signal. According to some such examples, the modulation may be a color modulation, a color saturation modulation and/or a light intensity modulation.
In some instances, the at least one previous signal may be, or may include, a sound signal. According to some such examples, the modulation may be a level modulation. Alternatively, or additionally the modulation may be a change in one or more of a fan speed, a flame size, a motor speed or an air flow rate.
According to some examples, the modulation may be what is referred to herein as a “swell.” A swell may be, or may include, a predetermined sequence of signal modulations. In some examples, the swell may include a first time interval corresponding to a signal level increase from a baseline level. According to some such examples, the swell may include a second time interval corresponding to a signal level decrease to the baseline level. In some instances, the swell may include a hold time interval after the first time interval and before the second time interval. The hold time interval may, in some instances, correspond to a constant signal level. In some examples, the swell may include a first time interval corresponding to a signal level decrease from a baseline level.
According to some examples, the relevance metric may be based, at least in part, on an estimated distance from a location. In some instances, the location may be an estimated location of the person. In some examples, the estimated distance may be an estimated distance from the location to an acoustic centroid of a plurality of microphones within the zone. According to some implementations, the relevance metric may be based, at least in part, on an estimated visibility of the corresponding device.
Some such methods may involve an automated process of determining whether a device is in a device group. According to some such examples, the automated process may be based, at least in part, on sensor data corresponding to light and/or sound emitted by the device. In some instances, the automated process may be based, at least in part, on communications between a source and a receiver. The source may, for example, be a light source and/or a sound source. According to some examples, the automated process may be based, at least in part, on communications between a source and an orchestrating hub device and/or a receiver and the orchestrating hub device. In some instances, the automated process may be based, at least in part, on a light source and/or a sound source being switched on and off for a duration of time.
Some such methods may involve automatically updating the automated process according to explicit feedback from the person. Alternatively, or additionally, some methods may involve automatically updating the automated process according to implicit feedback. The implicit feedback may, for example, be based on a success of beamforming based on an estimated zone, a success of microphone selection based on the estimated zone, a determination that the person has terminated the response of a voice assistant abnormally, a command recognizer returning a low-confidence result and/or a second-pass retrospective wakeword detector returning low confidence that a wakeword was spoken.
Some methods may involve selecting at least one speaker of a device located within the zone and controlling the at least one speaker to provide sound to the person. Alternatively, or additionally, some methods may involve selecting at least one microphone of a device located within the zone. Some such methods may involve providing signals output by the at least one microphone to a smart audio device.
Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented in a non-transitory medium having software stored thereon.
For example, the software may include instructions for controlling one or more devices to perform a method that involves controlling a system of devices in an environment. Some such methods may involve receiving output signals from each microphone of a plurality of microphones in the environment. Each of the plurality of microphones may reside in a microphone location of the environment. The output signals may, in some examples, correspond to an utterance of a person. According to some examples, at least one of the microphones may be included in or configured for communication with a smart audio device. In some instances, a first microphone of the plurality of microphones may sample audio data according to a first sample clock and a second microphone of the plurality of microphones may sample audio data according to a second sample clock.
Some such methods may involve determining, based at least in part on the output signals, a zone within the environment that has at least a threshold probability of including the person's location. Some such methods may involve generating a plurality of spatially-varying attentiveness signals within the zone. In some instances, each attentiveness signal of the plurality of attentiveness signals may be generated by a device located within the zone. Each attentiveness signal may, for example, indicate that a corresponding device is in an operating mode in which the corresponding device is awaiting a command. In some examples, each attentiveness signal may indicate a relevance metric of the corresponding device.
In some implementations, an attentiveness signal generated by a first device may indicate a relevance metric of a second device. The second device may, in some examples, be a device corresponding to the first device. In some instances, the utterance may be, or may include, a wakeword. According to some such examples, the attentiveness signals vary, at least in part, according to estimations of wakeword confidence.
According to some examples, at least one of the attentiveness signals may be a modulation of at least one previous signal generated by a device within the zone prior to a time of the utterance. In some instances, the at least one previous signal may be, or may include, a light signal. According to some such examples, the modulation may be a color modulation, a color saturation modulation and/or a light intensity modulation.
In some instances, the at least one previous signal may be, or may include, a sound signal. According to some such examples, the modulation may be a level modulation. Alternatively, or additionally the modulation may be a change in one or more of a fan speed, a flame size, a motor speed or an air flow rate.
According to some examples, the modulation may be what is referred to herein as a “swell.” A swell may be, or may include, a predetermined sequence of signal modulations. In some examples, the swell may include a first time interval corresponding to a signal level increase from a baseline level. According to some such examples, the swell may include a second time interval corresponding to a signal level decrease to the baseline level. In some instances, the swell may include a hold time interval after the first time interval and before the second time interval. The hold time interval may, in some instances, correspond to a constant signal level. In some examples, the swell may include a first time interval corresponding to a signal level decrease from a baseline level.
According to some examples, the relevance metric may be based, at least in part, on an estimated distance from a location. In some instances, the location may be an estimated location of the person. In some examples, the estimated distance may be an estimated distance from the location to an acoustic centroid of a plurality of microphones within the zone. According to some implementations, the relevance metric may be based, at least in part, on an estimated visibility of the corresponding device.
Some such methods may involve an automated process of determining whether a device is in a device group. According to some such examples, the automated process may be based, at least in part, on sensor data corresponding to light and/or sound emitted by the device. In some instances, the automated process may be based, at least in part, on communications between a source and a receiver. The source may, for example, be a light source and/or a sound source. According to some examples, the automated process may be based, at least in part, on communications between a source and an orchestrating hub device and/or a receiver and the orchestrating hub device. In some instances, the automated process may be based, at least in part, on a light source and/or a sound source being switched on and off for a duration of time.
Some such methods may involve automatically updating the automated process according to explicit feedback from the person. Alternatively, or additionally, some methods may involve automatically updating the automated process according to implicit feedback. The implicit feedback may, for example, be based on a success of beamforming based on an estimated zone, a success of microphone selection based on the estimated zone, a determination that the person has terminated the response of a voice assistant abnormally, a command recognizer returning a low-confidence result and/or a second-pass retrospective wakeword detector returning low confidence that a wakeword was spoken.
Some methods may involve selecting at least one speaker of a device located within the zone and controlling the at least one speaker to provide sound to the person. Alternatively, or additionally, some methods may involve selecting at least one microphone of a device located within the zone. Some such methods may involve providing signals output by the at least one microphone to a smart audio device.
At least some aspects of the present disclosure may be implemented via apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus may include an interface system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Like reference numbers and designations in the various drawings indicate like elements.
Some embodiments involve a system of orchestrated smart audio devices, in which each of the devices may be configured to indicate (to a user) when it has heard a “wakeword” and is listening for a sound command (i.e., a command indicated by sound) from the user.
A class of embodiments involves the use of voice based interfaces in various environments (e.g., relatively large living environments) where there is no single point of attention for the user interaction or user interface. As technology progresses towards extensive Internet of Things (JOT) automation and connected devices, there are many things around and on us that represent the ability to take sensory input and to deliver information through the change or transduction of signals into the environment. In the case of automation for our living or work spaces, intelligence (e.g., provided at least in part by automated assistant(s)) may be embodied in a very pervasive or ubiquitous sense in the environment in which we are living or working. There may be a sense that an assistant is a bit omnipresent and also non-intrusive, which may in itself create a certain paradoxical aspect of the user interface.
Home automation and assistants in our personal and living spaces may no longer reside in, control or embody a single device. There may be a design goal that collectively many devices try to present a pervasive service or presence. However, to be natural, we need to engage and trigger a normal sense of interaction and acknowledgement through interaction with such personal assistants.
It is natural that we engage such interfaces primarily with voice. In accordance with some embodiments it is envisaged that there is use of voice for both initiating an interaction (e.g., with at least one smart audio device), and also for engaging with at least one smart audio device (e.g., an assistant). In some applications, speech may be the most frictionless and high bandwidth approach to specifying more detail in a request, and/or providing ongoing interaction and acknowledgement.
However, the process of human communication, while anchored on language, is actually built on the first stages of signaling for and acknowledging attention. We typically do not issue commands or voice information without first having some sense that the recipient is available, ready and interested. The ways we can command attention are numerous, though at present in current system design and user interface, the way a system shows a response of attentiveness is more mirrored in the computing single interface text space than it is in interaction efficiency and naturalness. With most systems involving simple visual indicators (lights) primarily at the point of the device being the nearest microphone or user console, this is not well suited to foreseeable future living environments with more pervasive system integration and ambient computing.
Signaling and attention expression are key parts of the transaction where a user indicates a desire to interact with at least one smart audio device (e.g., virtual assistant), and each device shows awareness of the user and initial and ongoing attention towards comprehension and support. In conventional designs there are several jarring aspects of interactions where an assistant is seen as more of a discrete device interface. These aspects include:
Accordingly, we envision that interaction between a user and one or more smart audio devices will typically start with a call (originated by the user) to attention (e.g., a wakeword uttered by the user), and continue with at least one indication (or signal or expression) of “attentiveness” from the smart audio device(s), or from devices associated with the smart audio devices. We also envision that in some embodiments, at least one smart audio device (e.g., a suggestive assistant) may be constantly listening for sound signals (e.g., of a type indicating activity by a user), or may be continuously sensitive to other activity (not necessarily sound signals), and that the smart audio device will enter a state or operating mode in which it awaits a command (e.g., a voice command) from a user upon detecting sound (or activity) of a predetermined type. Upon entering this latter state or operating mode, each such device expresses attentiveness (e.g., in any of the ways described herein).
It is known to configure a smart audio device in a discrete physical zone to detect a user (who has uttered a wakeword that has been detected by the device), and to respond to the wakeword by transmitting a visual signal and/or an auditory signal which can be seen or heard by a user in the zone. Some disclosed embodiments implement a departure from this known approach by configuring one or more smart audio devices (of a system) to consider a user's position as uncertain (within some volume, or area, of uncertainty), and by using all available smart audio devices within the volume (or area) of uncertainty to provide a spatially-varying expression of “attentiveness” of the system through one or more (e.g., all) states or operating modes of the devices. In some embodiments, the goal is not to pick the single closest device to the user and override its current setting, but to modulate behavior of all the devices according to a relevance metric, which may in some examples be based at least in part on a device's estimated proximity to the user. This gives the sense of a system which is focusing its attention on a localized area, eliminating the jarring experience of a distant device indicating that the system is listening when the user is attempting to get the attention of a closer one of the devices.
Some embodiments provide (or are configured to provide) a coordinated utilization of all the smart audio devices in an environment or in a zone of the environment, by defining and implementing the ability of each device to generate an attentiveness signal (e.g., in response to a wakeword). In some implementations, some or all of the devices may be configured to “mix in” the attentiveness signal into a current configuration (and/or to generate the attentiveness signal to be at least partially determined by the current configurations of all the devices). In some implementations, each device may be configured to determine a probabilistic estimate of a distance from a location, such as the device's distance from the user's position. Some such implementations may provide a cohesive, orchestrated expression of the system's behavior in a way that is perceptually relevant to the user.
For a smart audio device which includes (or is coupled to) at least one speaker, the attentiveness signal may be sound emitted from at least one such speaker. Alternatively, or additionally, the attentiveness signal may be of some other type (e.g., light). In some example, the attentiveness signal may be or include two or more components (e.g., emitted sound and light).
Herein, we sometimes use the phrase “attentiveness indication” or “attentiveness expression” interchangeably with the phrase “attentiveness signal.”
In a class of embodiments, a plurality of smart audio devices may be coordinated (orchestrated), and each of the devices may be configured to generate an attentiveness signal in response to a wakeword. In some implementations, a first device may provide an attentiveness signal corresponding to a second device. In some examples, the attentiveness signals corresponding to all the devices are coordinated. Aspects of some embodiments pertain to implementing smart audio devices, and/or to coordinating smart audio devices.
In accordance with some embodiments, in a system, multiple smart audio devices may respond (e.g., by emitting light signals) in coordinated fashion (e.g., to indicate a degree of attentiveness or availability) to determination by the system of a common operating point (or operating state). For example, the operating point may be a state of attentiveness, entered in response to a wakeword from a user, with all the devices having an estimate (e.g., with at least one degree of uncertainty) of the user's position, and in which the devices emit light of different colors depending on their estimated distances from the user.
Following on from the study of users and experiments with interactions, the inventors have recognized some particular rules or guidelines which may apply to wide area life assistants expressing attention and which underpin some disclosed embodiments. These include the following:
It is well known that some things quickly anthropomorphize, and subtle aspects of timing and continuity have a large impact. Some disclosed embodiments implement continuous control of output devices in an environment to register some sensory effect on a user, and control the devices in a way to naturally swell and return to express attention and release, while avoiding jarring hard decisions around location and binary decisions of interaction threshold.
Unknown
March 3, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.