Patentable/Patents/US-20250322825-A1

US-20250322825-A1

Processing Continued Conversations Over Multiple Devices

PublishedOctober 16, 2025

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Implementations related to facilitating continued conversations of a user with an automated assistant when the user changes locations relative to one or more devices in an ecosystem of linked assistant devices. The user initially invokes a first device and provides a request, which is processed by the first device. The first device provides a notification to one or more other devices in the ecosystem to indicate that the user is likely to issue a further assistant request. The first device processes subsequent audio data to determine whether the subsequent audio data includes a further assistant request. The one or more other notified devices process device-specific sensor data to determine whether the user is co-present with the one of the other devices. If the user presence is detected, an indication is provided to the first device, causing the first device to cease processing subsequent audio data. Further, the co-present device starts to process subsequent audio data.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A method, implemented by one or more processors of an assistant device in an ecosystem of linked assistant devices, the method comprising:

. The method of, further comprising:

. The method of, wherein determining whether the user is co-present with the assistant device based on determining that the sensor data corresponds to the identifier of the user comprises:

. The method of, wherein the sensor data that is processed to generate the embedding is audio data.

. The method of, further comprising:

. The method of, wherein the contextual information includes the one or more previous requests of the user while the invoked additional automated assistant was invoked.

. The method of, wherein the contextual information includes the one or more responses to the user while the invoked additional automated assistant was invoked.

. The method of, wherein the contextual information further includes the one or more previous requests of the user while the invoked additional automated assistant was invoked.

. An assistant device in an ecosystem of linked assistant device, the assistant device comprising:

. The assistant device of, wherein one or more of the processors are further operable to execute the instructions to:

. The assistant device of, wherein in determining whether the user is co-present with the assistant device based on determining that the sensor data corresponds to the identifier of the user one or more of the processors are to:

. The assistant device of, wherein the sensor data that is processed to generate the embedding is audio data.

. The assistant device of, wherein one or more of the processors are further operable to execute the instructions to

. The assistant device of, wherein one or more of the processors are further operable to execute the instructions to:

. The assistant device of, wherein the contextual information includes the one or more previous requests of the user while the invoked additional automated assistant was invoked.

. The assistant device of, wherein the contextual information includes the one or more responses to the user while the invoked additional automated assistant was invoked.

. The assistant device of, wherein the contextual information further includes the one or more previous requests of the user while the invoked additional automated assistant was invoked.

Detailed Description

Complete technical specification and implementation details from the patent document.

Humans can engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “chat bots,” “interactive personal assistants,” “intelligent personal assistants,” “personal voice assistants,” “conversational agents,” etc.). For example, a human (which when interacting with an automated assistant may be referred to as a “user”) may provide an explicit input (e.g., commands, queries, and/or requests) to the automated assistant that can cause the automated assistant to generate and provide responsive output, to control one or more Internet of things (IoT) devices, and/or to perform one or more other functionalities (e.g., assistant actions). This explicit input provided by the user can be, for example, spoken natural language input (i.e., spoken utterances) which may in some cases be converted into text (or other semantic representation) and then further processed, and/or typed natural language input.

In some cases, automated assistants may include automated assistant clients that are executed locally by assistant devices and that are engaged directly by users, as well as cloud-based counterpart(s) that leverage the virtually limitless resources of the cloud to help automated assistant clients respond to users' inputs. For example, an automated assistant client can provide, to the cloud-based counterpart(s), audio data of a spoken utterance of a user (or a text conversion thereof), and optionally data indicative of the user's identity (e.g., credentials). The cloud-based counterpart may perform various processing on the explicit input to return result(s) to the automated assistant client, which may then provide corresponding output to the user. In other cases, automated assistants may be exclusively executed locally by assistant devices and that are engaged directly by users to reduce latency.

Many users may engage automated assistants in performing routine day-to-day tasks via assistant actions. For example, a user may routinely provide one or more explicit user inputs that cause an automated assistant to check the weather, check for traffic along a route to work, start a vehicle, and/or other explicit user input that causes the automated assistant to perform other assistant actions while the user is eating breakfast. As another example, a user may routinely provide one or more explicit user inputs that cause an automated assistant to play a particular playlist, track a workout, and/or other explicit user input that cause an automated assistant to perform other assistant actions in preparation for the user to go on a run. However, in some instances, multiple devices may be in the vicinity of the user and may be configured to process requests of the user. Thus, determining which device to best process audio data can improve the user experience when interacting with multiple devices that are configured in an ecosystem of connected devices.

Some implementations disclosed herein relate to selecting a second device to process audio data to continue a conversation with a user that was initiated with an automated assistant executing on a first device. The user can invoke an automated assistant on a first device and the first device can process a request that is provided with the invocation (e.g., before the invocation and/or proceeding the invocation). In addition to processing the request to determine one or more actions to be performed, the automated assistant can determine that a follow-up request is likely. In response, the automated assistant can provide a notification to one or more other devices that are linked to the first device via an ecosystem of linked devices, that a follow-up request is likely. Each of the notified devices (and/or automated assistants executing at least in part on each of the notified devices) can process sensor data from sensors of the corresponding device and determine whether the user is present in the vicinity of one of the other devices. If the user is detected to be present near one of the other devices, that device can send an indication to the first device that the user has changed position and that subsequent audio data can be processed by that device. In response, the first device can cease to process subsequent audio data and the device that is nearest to the user can commence processing audio data in anticipation of a follow-up request.

As an example, a user can invoke a first smart device (e.g., “OK Smart Speaker” to invoke a smart speaker) and issue a query of “What is the weather like here today?”). The automated assistant executing on the smart speaker can determine and provide a response (“It is going to beand sunny today”) and further determine that the user may follow up the query with another related query (e.g., “What about in Miami”). In response, the automated assistant executing on the smart speaker (or another application that is executing on the smart speaker) can provide a notification to one or more other linked devices that are part of an ecosystem of linked devices that includes the smart speaker (e.g., a second smart speaker in another room, a smart television, a smartphone in another location). The notification can indicate that the user may be issuing a follow-up query that may be captured by a microphone of one of the other devices. Each of the other devices can then process sensor data from sensors of the corresponding device (i.e., each automated assistant processes the sensor data from the sensors of the device that is executing the automated assistant) to determine whether the user has moved and is now present near the device. For example, a smartphone can utilize accelerometer data to determine whether the user has picked up and/or moved the smartphone, a device with a camera can determine whether the user has entered a field of view of the camera, etc.). Once user presence near one of the other devices is determined, that device can provide an indication to the first device (i.e., the device that processed the first request) to indicate that it will handle subsequent follow-up requests. In response, the first device can cease to process subsequent audio data and the device nearest to the user can commence processing audio data to capture any follow-up request that is provided by the user.

In some implementations, processing of audio data can be delayed such that an automated assistant does not start processing subsequent audio data immediately after performing an action but instead can delay until a later time. For example, a user can request that a song be played via one or more speakers of devices and audio data can be processed when the song ends and/or towards the end of the song. In a similar manner as previously described, the presence of the user can be determined based on sensor data from sensors of the devices of the ecosystem to determine whether the user has moved while the song plays. If the user is detected to be near a device other than the device that first received the invocation, the automated assistant of the first device can cease processing audio data and a device that is nearer to the user when the song is ending can start to process audio data.

In some implementations, the same automated assistant can be executing on multiple devices. For example, an automated assistant can be executing, in part, on a first device, a second device, and on a cloud-based device such that the cloud-based device performs some or all of the processing of the audio data. In those instances, all of the devices may have access to the context of the conversation that was initiated with the first device and thus can be utilized in processing of subsequent requests. However, in some instances, the device that is processing subsequent audio data may require the context of the previous conversation in order to process a follow-up request. For example, a first automated assistant, executing on a first device, can process the initial request, determine that a follow-up request is likely to occur, and send notifications to other linked devices to process sensor data to determine if the user has changed locations. If one of the devices provides a notification indicating presence of the user near that device, the first device can provide context for the prior request(s) such that the second device can resolve any contextual ambiguities in the follow-up request(s). For example, a user may invoke an automated assistant on a first device with “OK, Speaker” and continue with the query “what's the weather today.” This query can be followed up with “How about tomorrow,” which requires context from the previous query in order to resolve the intent of the request. Thus, the first device that is executing the first automated assistant can provide the second automated assistant, executing on a second device, with the context of the previous conversation such that the second automated assistant can resolve the intent of “how about tomorrow.”

In some implementations, the first automated assistant can determine a user profile that is associated with the user that invoked the first automated assistant. For example, the first automated assistant can utilize text dependent (TD) and/or text independent (TI) speaker verification to identify a user profile that is associated with the speaker. In some implementations, the first automated assistant can provide, along with the notification, an indication of the speaker of the first request (e.g., an indication of a user profile that is associated with the speaker, a vector embedding of the speaker) such that the one or more other devices can determine whether the same speaker is co-present with one or more of the other devices.

For example, one of the other devices can determine that a person is co-present with the device utilizing sensor data from one or more sensors of the device (e.g., accelerometer data) and can process some audio data and/or visual data to determine whether the co-present person is the user utilizing, for example, TI by comparing captured audio data to a vector representing the user speaking one or more phrases. Thus, in some implementations, the other devices can determine whether the same speaker as the user that uttered the initial request (and invocation) is co-present and only further process audio data to identify a follow-up request if it determines that the user is co-present with the device.

In some implementations, a user may utter an invocation phrase that invokes multiple automated assistants, either executing on the same device or executing on separate devices. For example, a user may utter an invocation phrase of “OK Assistant A and Assistant B, how tall is Barack Obama?,” which can invoke both a first assistant (i.e., “OK Assistant A”) and a second automated assistant (i.e., “OK Assistant B”). In some implementations, both automated assistants can process the audio data and determine that a follow-up query is likely. Further, each of the automated assistants can determine a likelihood of whether the follow-up query will likely be directed to it or to the other automated assistant. For example, the first automated assistant may be configured to handle queries in audio data whereas the second automated assistant may be configured to handle playback of media. If an invoked automated assistant determines that a follow-up query will not be likely to be directed to it (or more likely will be directed to another automated assistant, it can cease to process subsequent audio data and instead allow the more likely automated assistant to handle the follow-up query.

As another example, the user can invoke automated assistants executing on different devices by indicating, in the invocation, the devices that are executing the separate automated assistants. For example, the user may utter “OK Kitchen Devices, show me a weather map of Miami.” In response, automated assistants that are executing on multiple devices that are “Kitchen Devices,” (e.g., a smart speaker and a smart screen with a graphical interface) may be invoked and start to process the request. The smart speaker may determine that, if there is a follow-up query, the follow-up query may likely require a graphical interface. In response, the smart speaker can cease to process subsequent audio data and instead allow the smart screen that includes a graphical interface to handle subsequent follow-up queries (e.g., “Now show me Los Angeles”).

The above description is provided as an overview of only some implementations disclosed herein. Those implementations, and other implementations, are described in additional detail herein.

It should be appreciated that all combinations of the foregoing concepts and additional concepts described in greater detail herein are contemplated as being part of the subject matter disclosed herein. For example, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the subject matter disclosed herein.

Turning now to, an example environment in which techniques disclosed herein may be implemented is illustrated. The example environment includes a plurality of assistant input devicesand one or more cloud-based automated assistant components. One or more (e.g., all) of the assistant input devicescan execute a respective instance of a respective automated assistant client. However, in some implementations one or more of the assistant input devicescan optionally lack an instance of the respective automated assistant client, and still include engine(s) and hardware components for receiving and processing user input directed to an automated assistant (e.g., microphone(s), speaker(s), speech recognition engine(s), natural language processing engine(s), speech synthesis engine(s), and so on). An instance of the automated assistant clientcan be an application that is separate from an operating system of the respective assistant input devices(e.g., installed “on top” of the operating system)—or can alternatively be implemented directly by the operating system of the respective assistant input devices. As described further below, each instance of the automated assistant clientcan optionally interact with one or more cloud-based automated assistant componentsin responding to various requests provided by respective user interface componentsof any one of the respective assistant input devices. Further, and as also described below, other engine(s) of the assistant input devicescan optionally interact with one or more of the cloud-based automated assistant components.

One or more the cloud-based automated assistant componentscan be implemented on one or more computing systems (e.g., server(s) collectively referred to as a “cloud” or a “remote” computing system) that are communicatively coupled to respective assistant input devicesvia one or more local area networks (“LANs,” including Wi-Fi LANs, Bluetooth networks, near-field communication networks, mesh networks, etc.), wide area networks (“WANs,”, including the Internet, etc.), and/or other networks. The communicative coupling of the cloud-based automated assistant componentswith the assistant input devicesis indicated generally byof. Also, in some implementations, the assistant input devicesmay be communicatively coupled with each other via one or more networks (e.g., LANs and/or WANs).

An instance of an automated assistant client, by way of its interactions with one or more of the cloud-based automated assistant components, may form what appears to be, from a user's perspective, a logical instance of an automated assistant with which the user may engage in a human-to-computer dialog. For example, a first automated assistant can be encompassed by a first automated assistant clientof a first assistant input deviceand one or more cloud-based automated assistant components. A second automated assistant can be encompassed by a second automated assistant clientof a second assistant input deviceand one or more cloud-based automated assistant components. The first automated assistant and the second automated assistant may also be referred to herein simply as “the automated assistant”. It thus should be understood that each user that engages with an automated assistant clientexecuting on one or more of the assistant input devicesmay, in effect, engage with his or her own logical instance of an automated assistant (or a logical instance of automated assistant that is shared amongst a household or other group of users and/or shared amongst multiple automated assistant clients). Although only a plurality of assistant input devicesare illustrated in, it is understood that cloud-based automated assistant component(s)can additionally serve many additional groups of assistant input devices. Moreover, although various engines of the cloud-based automated assistant componentsare described herein as being implemented separate from the automated assistant clients(e.g., at server(s)), it should be understood that it is for the sake of example and is not meant to be limiting. For instance, one or more (e.g., all) of the engines described with respect to the cloud-based automated assistant componentscan be implemented locally by one or more of the assistant input devices.

The assistant input devicesmay include, for example, one or more of: a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a computing device of a vehicle of the user (e.g., an in-vehicle communications system, an in-vehicle entertainment system, an in-vehicle navigation system), an interactive standalone speaker (e.g., with or without a display), a smart appliance such as a smart television or smart washer/dryer, a wearable apparatus of the user that includes a computing device (e.g., a watch of the user having a computing device, glasses of the user having a computing device, a virtual or augmented reality computing device), and/or any IoT device capable of receiving user input directed to the automated assistant. Additional and/or alternative assistant input devices may be provided. In some implementations, the plurality of assistant input devicescan be associated with each other in various ways in order to facilitate performance of techniques described herein. For example, in some implementations, the plurality of assistant input devicesmay be associated with each other by virtue of being communicatively coupled via one or more networks (e.g., via the network(s)of). This may be the case, for instance, where the plurality of assistant input devicesare deployed across a particular area or environment, such as a home, a building, and so forth. Additionally, or alternatively, in some implementations, the plurality of assistant input devicesmay be associated with each other by virtue of them being members of a coordinated ecosystem that are at least selectively accessible by one or more users (e.g., an individual, a family, employees of an organization, other predefined groups, etc.). In some of those implementations, the ecosystem of the plurality of assistant input devicescan be manually and/or automatically associated with each other in a device topology representation of the ecosystem.

In various implementations, one or more of the assistant input devicesmay include one or more respective sensorsthat are configured to provide, with approval from corresponding user(s), sensor data indicative of one or more environmental conditions present in the environment of the device. In some of those implementations, the automated assistant can identify one or more of the assistant input devicesto satisfy a spoken utterance from a user that is associated with the ecosystem. The spoken utterance can be satisfied by rendering responsive content (e.g., audibly and/or visually) at one or more of the assistant input devices, by causing one or more of the assistant input devicesto be controlled based on the spoken utterance, and/or by causing one or more of the assistant input devicesto perform any other action to satisfy the spoken utterance.

The respective sensorsmay come in various forms. Some assistant input devicesmay be equipped with one or more digital cameras that are configured to capture and provide signal(s) indicative of movement detected in their fields of view. Additionally, or alternatively, some assistant input devicesmay be equipped with other types of light-based sensors, such as passive infrared (“PIR”) sensors that measure infrared (“IR”) light radiating from objects within their fields of view. Additionally, or alternatively, some assistant input devicesmay be equipped with sensorsthat detect acoustic (or pressure) waves, such as one or more microphones.

Additionally, or alternatively, in some implementations, the sensorsmay be configured to detect other phenomena associated with the environment that includes at least a part of the ecosystem. For example, in some embodiments, a given one of the assistant devicesmay be equipped with a sensorthat detects various types of wireless signals (e.g., waves such as radio, ultrasonic, electromagnetic, etc.) emitted by, for instance, other assistant devices carried/operated by a particular user (e.g., a mobile device, a wearable computing device, etc.) and/or other assistant devices in the ecosystem. For example, some of the assistant devicesmay be configured to emit waves that are imperceptible to humans, such as ultrasonic waves or infrared waves, that may be detected by one or more of the assistant input devices(e.g., via ultrasonic/infrared receivers such as ultrasonic-capable microphones). Also, for example, in some embodiments, a given one of the assistant devicesmay be equipped with a sensorto detect movement of the device (e.g., accelerometer), temperature in the vicinity of the device, and/or other environmental conditions that can be detected near the device (e.g., a heart monitor that can detect the current heart rate of the user).

Additionally, or alternatively, various assistant devices may emit other types of human-imperceptible waves, such as radio waves (e.g., Wi-Fi, Bluetooth, cellular, etc.) that may be detected by other assistant devices carried/operated by a particular user (e.g., a mobile device, a wearable computing device, etc.) and used to determine an operating user's particular location. In some implementations, GPS and/or Wi-Fi triangulation may be used to detect a person's location, e.g., based on GPS and/or Wi-Fi signals to/from the assistant device. In other implementations, other wireless signal characteristics, such as time-of-flight, signal strength, etc., may be used by various assistant devices, alone or collectively, to determine a particular person's location based on signals emitted by the other assistant devices carried/operated by the particular user.

Additionally, or alternatively, in some implementations, one or more of the assistant input devicesmay perform speaker recognition to recognize a user from their voice. For example, some instances of the automated assistant may be configured to match a voice to a user's profile, e.g., for purposes of providing/restricting access to various resources. Various techniques for user identification and/or authorization for automated assistants have been utilized. For example, in identifying a user, some automated assistants utilize text-dependent techniques (TD) that is constrained to invocation phrase(s) for the assistant (e.g., “OK Assistant” and/or “Hey Assistant”). With such techniques, an enrollment procedure is performed in which the user is explicitly prompted to provide one or more instances of a spoken utterance of the invocation phrase(s) to which the TD features are constrained. Speaker features (e.g., a speaker embedding) for a user can then be generated through processing of the instances of audio data, where each of the instances captures a respective one of the spoken utterances. For example, the speaker features can be generated by processing each of the instances of audio data using a TD machine learning model to generate a corresponding speaker embedding for each of the utterances. The speaker features can then be generated as a function of the speaker embeddings, and stored (e.g., on device) for use in TD techniques. For example, the speaker features can be a cumulative speaker embedding that is a function of (e.g., an average of) the speaker embeddings. Text-independent (TI) techniques have also been proposed for utilization in addition to or instead of TD techniques. TI features are not constrained to a subset of phrase(s) as is in TD. Like TD, TI can also utilize speaker features for a user and can generate those based on user utterances obtained through an enrollment procedure and/or other spoken interactions, although many more instances of user utterances may be required for generating useful TI speaker features.

After the speaker features are generated, the speaker features can be used in identifying the user that spoke a spoken utterance. For example, when another spoken utterance is spoken by the user, audio data that captures the spoken utterance can be processed to generate utterance features, those utterance features compared to the speaker features, and, based on the comparison, a profile can be identified that is associated with the speaker features. As one particular example, the audio data can be processed, using the speaker recognition model, to generate an utterance embedding, and that utterance embedding compared with the previously generated speaker embedding for the user in identifying a profile of the user. For instance, if a distance metric between the generated utterance embedding and the speaker embedding for the user satisfies a threshold, the user can be identified as the user that spoke the spoken utterance.

Each of the assistant input devicesfurther includes respective user interface component(s), which can each include one or more user interface input devices (e.g., microphone, touchscreen, keyboard, and/or other input devices) and/or one or more user interface output devices (e.g., display, speaker, projector, and/or other output devices). As one example, user interface componentsof assistant input devicecan include only speaker(s)and microphone(s), whereas user interface componentsof another assistant input devicecan include speaker(s), a touchscreen, and microphone(s).

Each of the assistant input devicesand/or any other computing device(s) operating one or more of the cloud-based automated assistant componentsmay include one or more memories for storage of data and software applications, one or more processors for accessing data and executing applications, and other components that facilitate communication over a network. The operations performed by one or more of the assistant input devicesand/or by the automated assistant may be distributed across multiple computer systems. The automated assistant may be implemented as, for example, computer programs running on one or more computers in one or more locations that are coupled to each other through a network (e.g., the network(s)of).

As noted above, in various implementations, each of the assistant input devicesmay operate a respective automated assistant client. In various embodiments, each automated assistant clientmay include a respective speech capture/text-to-speech (TTS)/speech-to-text (STT) module(also referred to herein simply as “speech capture/TTS/STT module”). In other implementations, one or more aspects of the respective speech capture/TTS/STT modulemay be implemented separately from the respective automated assistant client(e.g., by one or more of the cloud-based automated assistant components).

Each respective speech capture/TTS/STT modulemay be configured to perform one or more functions including, for example: capture a user's speech (speech capture, e.g., via respective microphone(s)); convert that captured audio to text and/or to other representations or embeddings (STT) using speech recognition model(s) stored in a database; and/or convert text to speech (TTS) using speech synthesis model(s) stored in a database. Instance(s) of these model(s) may be stored locally at each of the respective assistant input devicesand/or accessible by the assistant input devices (e.g., over the network(s)of). In some implementations, because one or more of the assistant input devicesmay be relatively constrained in terms of computing resources (e.g., processor cycles, memory, battery, etc.), the respective speech capture/TTS/STT modulethat is local to each of the assistant input devicesmay be configured to convert a finite number of different spoken phrases to text (or to other forms, such as lower dimensionality embeddings) using the speech recognition model(s). Other speech input may be sent to one or more of the cloud-based automated assistant components, which may include a cloud-based TTS moduleand/or a cloud-based STT module.

Cloud-based STT modulemay be configured to leverage the virtually limitless resources of the cloud to convert audio data captured by speech capture/TTS/STT moduleinto text (which may then be provided to natural language processing (NLP) module) using speech recognition model(s). Cloud-based TTS modulemay be configured to leverage the virtually limitless resources of the cloud to convert textual data (e.g., text formulated by automated assistant) into computer-generated speech output using speech synthesis model(s). In some implementations, the cloud-based TTS modulemay provide the computer-generated speech output to one or more of the assistant devicesto be output directly, e.g., using respective speaker(s)of the respective assistant devices. In other implementations, textual data (e.g., a client device notification included in a command) generated by the automated assistant using the cloud-based TTS modulemay be provided to speech capture/TTS/STT moduleof the respective assistant devices, which may then locally convert the textual data into computer-generated speech using the speech synthesis model(s), and cause the computer-generated speech to be rendered via local speaker(s)of the respective assistant devices.

The NLP moduleprocesses natural language input generated by users via the assistant input devicesand may generate annotated output for use by one or more other components of the automated assistant, the assistant input devices. For example, the NLP modulemay process natural language free-form input that is generated by a user via one or more respective user interface input devices of the assistant input devices. The annotated output generated based on processing the natural language free-form input may include one or more annotations of the natural language input and optionally one or more (e.g., all) of the terms of the natural language input.

In some implementations, the NLP moduleis configured to identify and annotate various types of grammatical information in natural language input. For example, the NLP modulemay include a part of speech tagger configured to annotate terms with their grammatical roles. In some implementations, the NLP modulemay additionally and/or alternatively include an entity tagger (not depicted) configured to annotate entity references in one or more segments such as references to people (including, for instance, literary characters, celebrities, public figures, etc.), organizations, locations (real and imaginary), and so forth. In some implementations, data about entities may be stored in one or more databases, such as in a knowledge graph (not depicted). In some implementations, the knowledge graph may include nodes that represent known entities (and in some cases, entity attributes), as well as edges that connect the nodes and represent relationships between the entities.

The entity tagger of the NLP modulemay annotate references to an entity at a high level of granularity (e.g., to enable identification of all references to an entity class such as people) and/or a lower level of granularity (e.g., to enable identification of all references to a particular entity such as a particular person). The entity tagger may rely on content of the natural language input to resolve a particular entity and/or may optionally communicate with a knowledge graph or other entity database to resolve a particular entity.

In some implementations, the NLP modulemay additionally and/or alternatively include a coreference resolver (not depicted) configured to group, or “cluster,” references to the same entity based on one or more contextual cues. For example, the coreference resolver may be utilized to resolve the term “it” to “front door lock” in the natural language input “lock it”, based on “front door lock” being mentioned in a client device notification rendered immediately prior to receiving the natural language input “lock it”.

In some implementations, one or more components of the NLP modulemay rely on annotations from one or more other components of the NLP module. For example, in some implementations the named entity tagger may rely on annotations from the coreference resolver and/or dependency parser in annotating all mentions to a particular entity. Also, for example, in some implementations the coreference resolver may rely on annotations from the dependency parser in clustering references to the same entity. In some implementations, in processing a particular natural language input, one or more components of the NLP modulemay use related data outside of the particular natural language input to determine one or more annotations-such as an assistant input device notification rendered immediately prior to receiving the natural language input on which the assistant input device notification is based.

In some implementations, one or more of the assistant input devicescan include a second automated assistantthat includes one or more of the components that share characteristics with components that are described herein with respect to automated assistant clientand/or cloud-based components. For example, in addition to or in lieu of including automated assistant client, one or more assistant input devicescan include a second automated assistantthat can include a speech capture component, TTS, STT, NLP, and/or one or more fulfillment engines for processing requests that are received from the user. In some implementations, second automated assistantcan include one or more cloud-based components that are unique from the cloud-based components. For example, second automated assistantcan be a standalone automated assistant with capabilities to process audio data, recognize one or more wakewords and/or invocation phrases, process additional audio data to identify a request that is included in the audio data, and cause one or more actions to be performed in response to the request.

Referring to, an illustration of an ecosystem of connected devices is provided for example purposes. In some implementations, an ecosystem can include additional components, fewer components, and/or different components than what are illustrated in. However, for example, purposes with regards to examples further described in relation to, an ecosystem will be utilized as an example that includes a kitchen speaker, a bedroom speaker, and a living room speaker, each including one or more components of an assistant input device. For example, kitchen speakercan be executing an automated assistant clientthat further includes one or more of the components of the automated assistant clientillustrated in(and further, one or more sensorsand/or user interface components). The ecosystem ofcan include devices that are connected to one another and that are further in the environment of the user but in different locations. For example, a user may have a kitchen speakersitting on a counter in the kitchen, bedroom speakerin a bedroom, and a living room speaker in a living room. Thus, for explanation purposes only, speakers of devices,, andcan be heard by the user (thus can respond to requests) and/or microphones of the devices,, andcan capture audio data of a user uttering an invocation phrase and/or request while the user is near (i.e., co-present) with the device.

Further, for exemplary purposes only, examples described herein will assume that kitchen speakerand bedroom speaker share one or more cloud-based assistant componentsthat perform one or more actions. However, it is to be understood that one or more of the assistant input devicescan include one or more of the components illustrated inas being components of the cloud-based assistant and can be performed by a device of the ecosystem that includes one or more of the components. For example, a living room speakercan include one or more of the components of the cloud-based assistant componentsand can, in some configurations, receive and process requests in the same manner as described herein with regards to kitchen speakerand bedroom speaker, which share cloud-based automated assistant components(i.e., living room speakermay be a device that does not include an automated assistant clientbut instead only includes a second automated assistant. Further, one or more other components, such as information related to the context of conversations between the automated assistant and the user, may be accessible to one or more of the assistant input devices. Thus, for a conversation that is started with the automated assistant client of kitchen speaker, the context may be available to bedroom speakerbut not to living room speaker.

In some implementations, the automated assistant executing on one of the devices of the ecosystem of devices can be invoked by the user performing an action (e.g., touching the device, performing a gesture that is captured by a camera of the device) and/or uttering an invocation phrase that indicates that the user has interest in the automated assistant performing one or more actions. For example, the user can utter “OK Kitchen Assistant,” and the automated assistant of kitchen speakercan process audio that precedes and/or follows the invocation to determine whether a request is included in the audio data. Audio data that is captured by the microphones of kitchen speakercan be processed, utilizing STT, NLP, and/or ASR, by the automated assistant client executing on the device and/or, at least in part, by one or more components of cloud-based components.

Upon determining that the audio data includes a request, action processing engine(shown inas a component of cloud-based automated assistant components, but can additionally or alternatively be a component of automated assistant client) can determine one or more actions to perform and cause performance of the action(s). For example, the user may utter “OK Kitchen Assistant,” followed by “how tall is Barack Obama.” In response, action processing enginecan generate a response to the request (or “query”) and provide, via the microphone(s)of kitchen speaker, a response to the query.

In some implementations, the user may activate, prior to starting a conversation with one or more of the automated assistants, a continued conversation mode that allows the user to continue with subsequent requests without explicitly re-invoking the automated assistant. For example, in some implementations, after an action is initiated and/or after an action is performed, the invoked automated assistant can continue to process audio data to determine whether a follow-up request and/or query is included in the audio data. For example, a user may invoke an automated assistant with an invocation phrase of “OK Assistant, what is the weather in Miami” and the automated assistant can generate and provide a response. Subsequently, the automated assistant can continue to process audio data to determine whether the user provides a follow-up request. This can include, for example, processing the audio data locally by the automated assistant client without providing for complete processing until it is determined that the audio data includes a follow-up request, processing only a limited amount of audio data (e.g., 5 seconds after a response has been provided to the first query), and/or other limited processing that can determine whether the user has followed up the first request with an additional request. In some implementations, automated assistant clientand/or one or more components of cloud-based automated assistantcan determine whether a query is likely to be followed by a follow-up query. For example, for the initial query of “What is the weather in Miami,” automated assistantcan determine that the user is likely to issue an additional query (e.g., a related query), such as “How about in Orlando,” and process the audio data to determine whether an additional query is submitted by the user.

In some implementations, the automated assistant can process audio data as an action is being performed and/or subsequent to an action being performed in response to a request by the user. For example, the user can provide a request of “Play the song we will rock you,” and in response, the automated assistant can cause the song “We Will Rock You” to be played via the speakers of the device that is executing the automated assistant. Once the song has completed and/or as the song is completing, the automated assistant can begin to process audio data in anticipation of the user requesting a new song to be played after the current song has completed. Thus, in some implementations, computing resources can be conserved by only processing audio data when it is likely that the user will issue a new request rather than processing continuously and wasting resources when the user is less likely to issue a new request.

Once an invoked automated assistantdetermines that a follow-up request is likely (e.g., a “continued conversation” mode is active and a request is of a type that may be followed by another request without an intervening invocation phrase), the automated assistantcan send a notification to one or more other devices, each executing an automated assistant and/or automated assistant client, that the user may provide a follow-up request. Referring again to, assistant input devicesinclude a co-presence monitorthat can determine, based on sensor data generated by sensors, camera, and/or microphone, whether a user is co-present with the device. As illustrated, co-presence monitoris illustrated as a component of assistant input device. However, in some implementations, co-presence monitorcan be a component of automated assistant client, the second automated assistant, the cloud-based automated assistant component(s), and/or shared between one or more of the automated assistant components.

In some implementations, a notification can be provided to the one or more other devices of the ecosystem of devices when the user is detected to be moving away from the invoked device and/or that the user is currently changing positions. For example, the invoked automated assistantcan process sensor data from one or more sensors of the assistant deviceto determine whether the user is present near the device. If automated assistantdetermines, based on the sensor data, that the user is changing position to a position that is farther away from the device, the automated assistantcan provide the notification to the other devices of the ecosystem of devices so that those devices can begin processing sensor data to determine whether the user is changing position to a location that is nearer to one of those other devices.

In some implementations, the invoked automated assistantcan provide, with the notification, an indication of the user that invoked the automated assistant (and subsequently provided a request). For example, a user can invoke kitchen speakerwith “OK Kitchen Speaker,” and one or more components of the automated assistant executing on the kitchen speakerand/or one or more shared componentscan utilize one or more TD speaker identification models to determine a vector based on the speech. The vector can be utilized to identify a user profile that is associated with speaker by comparing the vector to one or more stored vectors for the user that were, for example, generated during a registration period by the user (e.g., the user may be prompted to say “OK Kitchen Speaker” one or more times and a profile of the user speaking the phrase can be stored with a user profile). Also, for example, kitchen speakercan utilize one or more TI models to process the request provided by the user and a TI vector can be generated that can be provided with the notification to the one or more other devices of the ecosystem. Thus, in some implementations, a user profile indicator and/or a vector representing at least some portion of an utterance of the user can be provided to, for example, bedroom speakerand/or living room speakersuch that the other devices can determine, based on processing of audio data captured by that device, whether a co-present user is the same user as the one that uttered the initial request (and/or invocation phrase).

Assistant input devicefurther includes a speaker identification enginethat can determine, based on processing audio and/or camera data, whether a user that is associated with a user profile is the same as the user that uttered an invocation phrase and/or request.

Referring to, a timing diagram is illustrated that illustrates one or more implementations described herein. As illustrated, kitchen speakerprocesses the invocationand further processes a request. In response, one or more actionsis performed by the kitchen speaker. In addition to processing the request, kitchen speaker(i.e., the automated assistant executing on kitchen speaker) can further determine that a follow-up request is likely, which can be determined by the content of the request in addition to a “continued conversation” mode being active. Once the action is performed(or while the action is being performed) subsequent audio data can be processedby the kitchen speakerto identify a follow-up request that may be included in the subsequent audio data.

Notifications are provided, by the kitchen speaker, to the other devices of the ecosystem of linked devices. In some implementations, the notificationscan include an indication that a user may be co-present with the device(s) and to determine if/when this occurs. In some implementations, the invocation phrase and/or the request can be processed to determine a user and/or a user profile, and an indication of the user or a user speaker profile (e.g., a vector representing the speaking voice of the user) can be provided with the notification.

In response to receiving the notifications, the bedroom speakerand the living room speakercan process sensor datautilizing a component that shares one or more characteristics with co-presence monitorto determine whether a user is co-present with each device. Co-presence monitorcan utilize, for example, sensor data from one or more sensors, microphones, and/or camerato determine whether a user is present in proximity to the respective deviceand. If a user is detected in proximity of a device, such as bedroom speaker, additional processing can be performed to determine whether the co-present user is the same user that uttered the invocation phrase and/or provided the request. Thus, as an additional optional step, the user profile that was provided with the notification atcan be verified at. As illustrated, bedroom speakerprocesses its sensor dataand living room speakerprocesses its sensor data. Further, as illustrated, bedroom speakerdetermines a user is co-presentand, optionally, verifies that the co-present user is the same as the user indicated, by kitchen speaker, as uttering the invocation and/or initial request.

In response to determining that the user is co-present, bedroom speakercan provide one or more indicationsto the kitchen speaker(and optionally, to the other device(s) of the ecosystem) to indicate that the user is now co-present with the bedroom speaker. As illustrated, upon receiving the indication, living room speakerceases to process sensor data. However, in some implementations, living room speakercan continue to process sensor data to determine if the user has moved and, subsequently, whether the user is co-present with living room speaker. For example, the user may be walking from the kitchen, to the bedroom, and then to the living room rapidly. Thus, in some implementations, co-presence of the user with the bedroom speakermay only be temporary and the user may subsequently be co-present with the living room speakerafter temporary co-presence with the bedroom speakerwas determined by bedroom speaker.

When kitchen speakerreceives an indication that the user (or a user) is co-present with the bedroom speaker, the kitchen speakercan cease to process subsequent audio data. Further, bedroom speakercan start to process subsequent audio data. In some implementations, bedroom speakercan start to process subsequent audio databefore it sends the indicationto ensure that, at all times, at least one device is processing the subsequent audio data. In some implementations, the method can continue, with the bedroom speakersending notifications to the other devices indicating that it is processing subsequent audio data and further that a follow-up request is likely. The other devices, including the kitchen device, can process sensor data to determine whether the user has moved again and may be co-present with one of the other devices.

In some implementations, the kitchen devicecan provide contextof previous requests to the device that is co-present with the user before the co-present device (e.g.,) starts to process subsequent audio data. In some instances, the co-present device may already have the context, such as when the invoked automated assistant and the co-present automated assistant are both components of the same automated assistant that shares one or more components (e.g., both clients of the same automated assistant, wherein the context is stored on a cloud-based component shared by both clients). However, in instances wherein the invoked automated assistant and the co-present automated assistant are different automated assistants, context of the previous request(s) may be required by the co-present automated assistant before the co-present automated assistant can process subsequent requests.

Patent Metadata

Filing Date

Unknown

Publication Date

October 16, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search