US-12585821-B2

Voice privacy for far-field voice control devices that use remote voice services

PublishedMarch 24, 2026

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A system includes a first module and a second module. The first module may be configured to perform operations including generating voice data based on an input audio, anonymizing the voice data by applying a first audio transformation, and transmitting the anonymized voice data to a first remote ASR module for generating speech recognition data. The second module may be configured to perform operations including separating the input audio into a first data and a second data, anonymizing the first data by applying a second audio transformation to the first data, generating an anonymized audio data by combining the anonymized first data and the second data, and transmitting the anonymized audio data to a second remote ASR module for generating speech recognition data.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A system, comprising:

. The system of, further comprising:

. The system of, wherein:

. The system of, wherein the separating of the input audio into the first data and the second data comprises:

. The system of, wherein combining the anonymized first data and the second data comprises:

. An integrated circuit, comprising:

. The integrated circuit of, further comprising:

. The integrated circuit of, wherein:

. The integrated circuit of, wherein the separating of the input audio into the first data and the second data comprises:

. The integrated circuit of, wherein combining the anonymized first data and the second data comprises:

. A method, comprising:

. The method of, wherein the separating of the input audio into the first data and the second data comprises:

. The method of, wherein combining the anonymized first data and the second data comprises:

. A system, comprising:

. The system of, wherein separating the input audio into the first data and the second data comprises:

. The system of, further comprising:

. An integrated circuit, comprising:

. The integrated circuit of, wherein separating the input audio into the first data and the second data comprises:

. The integrated circuit of, further comprising:

. A method, comprising:

. The method of, wherein separating the input audio into the first data and the second data comprises:

. The method of, further comprising:

Detailed Description

Complete technical specification and implementation details from the patent document.

Exemplary implementations of this disclosure may generally relate to systems, integrated circuits, and methods for far-field voice processing and, more particularly, to voice privacy for far-field voice control devices that use remote voice services.

Voice control devices (e.g., smart speakers and voice assistants) often have limited processing power and provide voice data to remote (e.g., cloud) computers for processing, such as speech recognition to interpret user commands. To do so, the voice control devices can include various software development kits (SDKs) for transporting voice data to remote computers. However, using SDKs and offloading voice data to external devices may allow remote computers to perform other processing unknown to the user to gather further information about the user, such as data mining for personally identifiable information.

Voice data can be used to extract various kinds of information, including identity, gender, age, emotional state, location, and accent. In the interest of user privacy, it is desired that only data relevant to the speaker's direct intentions/commands be stored or uploaded to remote computers. Thus, private information should be removed from the voice signal before it is provided to the voice processing SDK. For example, if the intent is a speech recognition task, information on other unrelated recognition activities (e.g., identity, gender, age, emotion, and accent) should be masked/withheld before the voice signal is provided to the voice processing SDK.

Exemplary implementations include a system, including a first module and a second module. The first module may be configured to perform operations including generating voice data based on an input audio, anonymizing the voice data by applying a first audio transformation, and transmitting the anonymized voice data to a first remote automatic speech recognition (ASR) module for generating speech recognition data. The second module may be configured to perform operations including separating the input audio into a first data and a second data, anonymizing the first data by applying a second audio transformation to the first data, generating an anonymized audio data by combining the anonymized first data and the second data, and transmitting the anonymized audio data to a second remote ASR module for generating speech recognition data.

Exemplary implementations also include an integrated circuit, including a first module and a second module. The first module may be configured to perform operations including generating voice data based on an input audio, anonymizing the voice data by applying a first audio transformation, and transmitting the anonymized voice data to a first remote ASR module for generating speech recognition data. The second module may be configured to perform operations including separating the input audio into a first data and a second data, anonymizing the first data by applying a second audio transformation to the first data, generating an anonymized audio data by combining the anonymized first data and the second data, and transmitting the anonymized audio data to a second remote ASR module for generating speech recognition data.

Exemplary implementations further include a method including receiving an input audio and a reference signal and selecting one or more modules of a set of modules. In response to selecting a first module, the method includes generating voice data based on the input audio, anonymizing the voice data by applying a first audio transformation, transmitting the anonymized voice data to a first remote ASR module for generating speech recognition data. In response to selecting a second module, the method further includes separating the input audio into a first data and a second data, anonymizing the first data by applying a second audio transformation to the first data, generating an anonymized audio data by combining the anonymized first data and the second data, and transmitting the anonymized audio data to a second remote ASR module for generating speech recognition data.

The figures depict various implementations for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative implementations of the structures and methods illustrated herein may be employed without departing from the principles described herein.

Not all depicted components may be used in all implementations, however, and one or more implementations may include additional or different components than those shown in the figures. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional components, different components, or fewer components may be provided.

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology can be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein and can be practiced using one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.

Voice control devices (e.g., smart speakers and voice assistants) may include an SDK for transporting voice data to remote computers. However, the use of such SDKs and offloading voice data to external devices—as performed in the current approaches utilized by voice control devices—may allow remote computers to perform other processing unknown to the user to gather further information about the user, such as data mining for personally identifiable information.

A use of far-field voice (FFV) control devices may be to effectively recognize commands spoken by users far from the FFV control device and in environments having a variety of noise levels, such as environments with music or people talking in the background. To achieve this objective, an array of microphones is typically used. An FFV processing module may first process the audio data captured by the microphone array to enhance the voice content of the audio data (e.g., by removing noise) and then provide the enhanced voice to an ASR module (e.g., a cloud ASR SDK) for recognizing a command in the voice content. If the audio data is sent to a remote computer (e.g., a cloud service), privacy concerns may be raised as the audio samples may include sufficient information for secondary purposes (e.g., detection of identity, gender, age, emotional state, accent, and the like).

Therefore, aspects of the subject technology provide a single system for anonymizing audio data for multiple audio processing pipelines that may use remote computers for audio processing.

illustrates an exemplary network configurationof a voice control device, in accordance with one or more aspects of the subject technology. A voice control devicemay be a computer device (e.g., a set-top box, a voice assistant, and the like) for receiving audio data that may contain a commandfrom a user. The audio data may be near- or far-field audio data, where near-field audio data may be in proximity to the voice control device(e.g., within 10 feet) and far-field audio data may be distant from the voice control device(e.g., beyond 10 feet). The commandmay be a process performed by the voice control device, such as searching for a query, setting a timer, playing music, and the like. The environmentin which the user provides a commandto the voice control devicemay include noise, such as music, conversations, and any other ambient sounds.

The voice control devicereceives audio data, which may include the commandfrom the userand noisefrom the environment. The voice control devicemay anonymize the audio data before providing the audio data to one or more ASR SDKs. The voice control devicemay include multiple pipelines for anonymizing the audio data, for example, to provide cross-compatibility among different ASR platforms. The voice control devicemay also include pipelines for anonymizing the audio data for local ASR, where local ASR includes ASR that can be performed on the voice control devicewithout data leaving the device.

The ASR SDKs of the voice control devicemay provide the anonymized audio data to a remote computer(e.g., a cloud server) for processing via a network. The remote computermay be any device that is external to the voice control device. For example, the remote computermay be a cloud server off-premises from the voice control device. The processing may include voice separation, speech recognition, command identification, and the like. The processing results may be used by the remote computerfor executing the commandand providing the results of the commandto the voice control device. The processing results may also or instead be sent to the voice control devicefor executing the commandand/or providing the results of the commandas feedbackto the user.

illustrates a block diagram of an exemplary computing systemin accordance with one or more aspects of the subject technology. The computing systemmay be, and/or may be a part of, the voice control device, as shown in. The computing systemmay include various types of computer-readable media and interfaces for various other types of computer-readable media. The computing systemincludes a bus, a processing unit, a storage device, a system memory, an input device interface, an output device interface, an FFV module, an ASR module, a voice privacy module, and/or a network interface.

The buscollectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of the computing system. In one or more implementations, the buscommunicatively connects the processing unitwith the other components of the computing system. From various memory units, the processing unitretrieves instructions to execute and data to process in order to execute the operations of the subject disclosure. The processing unitmay be a controller and/or a single- or multi-core processor or processors in various implementations.

The busalso connects to the input device interfaceand output device interface. The input device interfaceenables the system to receive inputs. For example, the input device interfaceallows a user to communicate information and select commands on the system. The input device interfacemay be used with input devices such as keyboards, mice, and other user input devices, as well as microphones (e.g., microphone arrays), cameras, and other sensor devices. The output device interfacemay enable, for example, a display of images generated by computing system. Output devices that may be used with the output device interfacemay include, for example, printers and display devices, such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flexible display, a flat panel display, a solid-state display, a projector, speakers (e.g., speaker arrays), haptic, or any other device for outputting information. One or more implementations may include devices that function as both input and output devices, such as a touchscreen.

The busalso couples the systemto one or more networks and/or to one or more network nodes through the network interface. The network interfacemay include one or more interfaces that allow the systemto be a part of a network of computers (such as a local area network (LAN), a wide area network (WAN), or a network of networks (the “Internet”)). Any or all components of the systemmay be used in conjunction with the subject disclosure.

The FFV modulemay include hardware and/or software for processing far-field voice data. The FFV modulemay include one or more algorithms (e.g., computer-readable instructions) that include accessing audio input captured from a microphone array (e.g., the input device interface) and separates and/or enhances the audio data from target sources (e.g., the user) for applications, such as ASR, which can use remote (e.g., cloud) voice services and/or local (e.g., on-the-edge) voice services. The FFV modulemay include one or more algorithms (e.g., computer-readable instructions) for acoustic echo cancelation (AEC), which may be used to remove audio output by the systemthat is subsequently captured back by the system. For example, the systemmay be playing music when a user command is received; the audio captured by a microphone of the system may include both the music and the user utterance (e.g., a command). AEC may include accessing one or more reference signals destined for integrated and/or external speakers (e.g., the output device interface), receiving audio from microphones (e.g., the input device interface), and removing parts of the received audio that the system knows was output from the systemgenerated based on the reference signal.

The ASR modulemay include hardware and/or software for performing ASR on voice data. Performing ASR on voice data may include receiving voice data and extracting speech recognition data. Speech recognition data may include speech, words, commands, intentions, and the like. ASR may be performed via a hidden Markov model, dynamic time warping, machine learning model (e.g., neural networks), end-to-end ASR, and the like.

The voice privacy modulemay include hardware and/or software preparing voice data for input into one or more ASR SDKs. The voice privacy modulemay include one or more algorithms for anonymizing voice data according to one or more voice processing pipelines, which may include one or more steps where the output of one step is the input to the next. Pipelines may share resources with other pipelines and may operate concurrently. Example pipelines are described below with respect to,,, and. In one or more implementations, voice processing pipelines may be separated into their own modules such that each module includes hardware and/or software for executing a voice processing pipeline.

The storage devicemay be a read-and-write memory device. The storage devicemay be a non-volatile memory unit that stores instructions and data (e.g., static and dynamic instructions and data) even when the computing systemis off. In one or more implementations, a mass-storage device (such as a solid-state, magnetic or optical disk and its corresponding disk drive) may be used as the storage device. In one or more implementations, a removable storage device (such as a floppy disk, flash drive, and its corresponding disk drive) may be used as the storage device.

Like the storage device, the system memorymay be a read-and-write memory device. However, unlike the storage device, the system memorymay be a volatile read-and-write memory, such as random-access memory. The system memorymay store any of the instructions and data that one or more processing unitmay need at runtime to perform operations. In one or more implementations, the processes of the subject disclosure are stored in the system memoryand/or the storage device. From these various memory units, the one or more processing unitsretrieves instructions to execute and data to process in order to execute the processes of one or more implementations.

Implementations within the scope of the subject technology may be partially or entirely realized using a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) encoding one or more instructions. The tangible computer-readable storage medium also may be non-transitory in nature.

The computer-readable storage medium may be any storage medium that may be read, written, or otherwise accessed by a general purpose or special purpose computing device, including any processing electronics and/or processing circuitry capable of executing instructions. For example, without limitation, the computer-readable medium may include any volatile semiconductor memory (e.g., the system memory), such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM. The computer-readable medium also may include any non-volatile semiconductor memory (e.g., the storage device), such as ROM, PROM, EPROM, EEPROM, NVRAM, flash, SSD, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.

Further, the computer-readable storage medium may include any non-semiconductor memory, such as optical disk storage, magnetic disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium may be directly coupled to a computing device, while in other implementations, the tangible computer-readable storage medium may be indirectly coupled to a computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.

Instructions may be directly executable or may be used to develop executable instructions. For example, instructions may be realized as executable or non-executable machine code or as instructions in a high-level language that may be compiled to produce executable or non-executable machine code. Further, instructions also may be realized as or may include data. Computer-executable instructions also may be organized in any format, including routines, subroutines, programs, data structures, objects, binaries, modules, applications, applets, functions, SDKs, frameworks, and the like. As recognized by those of skill in the art, details including, but not limited to, the number, structure, sequence, and organization of instructions may vary significantly without varying the underlying logic, function, processing, and output. It is intended that “modules” as used herein not only refers the computer-executable instructions but also or instead to hardware (e.g., computer circuitry) that may carry out the processes described herein.

While the above discussion primarily refers to microprocessors or multi-core processors that execute software, one or more implementations are performed by one or more integrated circuits, such as DSPs, ASICs or FPGAs. In one or more implementations, such integrated circuits execute instructions that are stored on the circuit itself.

illustrates a schematic diagram of an approachfor FFV control. The approachincludes separate FFV and ASR SDK modules. In neither module is voice anonymization performed. The approachmay be performed on a chip within the voice control device. Microphones,of the voice control devicemay receive audio data from the userand/or the environment. It should be understood that the approachis not limited to a two-microphone array and may control more or fewer microphones.

The microphones,pass the audio data to the FFV module. The audio data is processed by the FFV module, the output of which is provided to an ASR SDK. Processing may include voice extraction/enhancement. The output of the FFV modulemay include the voice data of the userextracted from the audio data. The ASR SDKmay send the voice data to the remote computer(e.g., cloud server), and the remote computermay perform ASR on the voice data (e.g., to determine the command in the voice data). The result of the ASR from the remote computermay be sent to the ASR SDKfor further processing and/or for providing as output speech recognition datato the rest of the voice control device.

The ASR SDKmay be provided by a third party (e.g., the ASR provider). From the perspective of the voice control device, the ASR SDKis a black box that receives a form of the audio data and outputs speech recognition data, which may include commands, text, audio, and the like, from the audio data as determined by the remote computervia the ASR SDK.

illustrates a schematic diagram of another approachfor FFV control. The approachincludes a combined FFV and ASR SDK modules (the combined module). Voice anonymization is not performed in any of the modules (e.g., combined module, FFV module, and/or ASR SDK) prior to sending the voice data to a remote computer (e.g., the remote computer). The approachmay be performed on a chip within the voice control device. Microphones,of the voice control devicemay receive audio data from the userand/or the environment. It should be understood that the approachis not limited to a two-microphone array and may control more or less microphones.

The microphones,pass the audio data to the combined module. Like the ASR SDK, the combined modulemay be provided by a third party and is treated as a black box from the perspective of the voice control device. In the approach, it is almost entirely up to the combined modulehow to process the audio data (e.g., how to generate the voice data from the audio data) and how it is moved between the voice control deviceand the remote computer. The combined modulereceives the audio data and outputs speech recognition data, which may include commands, text, and the like, from the audio data, as determined by the remote computervia the ASR SDK. In one or more implementations, the functions of the FFV moduleare performed at the remote computer.

illustrates a schematic diagram of yet another approachfor FFV control. The approachincludes an FFV moduleand an ASR module. The approachmay also be performed on a chip within the voice control device. Microphones,of the voice control devicemay receive audio data from the userand/or the environment. It should be understood that the approachis not limited to a two-microphone array and may control more or fewer microphones.

The microphones,pass the audio data to the FFV module. The audio data is processed by the FFV module, the output of which is provided to ASR module. The output of the FFV modulemay include the voice data of the user. The ASR modulemay receive the voice data for performing ASR locally on the voice control device. In one or more implementations, the FFV moduleand the ASR modulemay be a combined module, similar to the combined module. The output speech recognition dataof the ASR modulemay include commands, text, and the like, from the voice data.

This approach has the highest order of privacy, as compared to the approaches ofand, as voice samples do not leave the voice control device(e.g., to the remote computer). The drawback to the approach ofis that the voice control deviceis limited to its own computational resources, which may negatively impact the performance of the ASR.

illustrates a schematic diagram of a voice anonymization systemfor FFV control, in accordance with one or more aspects of the subject technology. The voice anonymization systemprepares audio data for ASR processing and allows for a voice control deviceto have compatibility across multiple ASR platforms while maintaining the privacy of the user. The systemmay be a part of a voice control deviceand may include an FFV moduleand a voice privacy module. In one or more implementations, the systemmay also include one or more of the microphones,. For example, the microphones,may be connected to or integrated with the system. The systemmay be a single integrated circuit on the voice control device. Microphones,of the voice anonymization systemmay receive audio data from the user, the environment, and/or a speaker(e.g., integrated with or connected to the voice control device). It should be understood that the systemis not limited to a two-microphone array as shown and may control more or fewer microphones.

The microphones,pass the audio data to the FFV module. The audio data is processed by the FFV module, the output of which is provided to the voice privacy module. The microphones,may also or instead pass the audio data to the voice privacy modulefor subsequent modules that may incorporate FFV.

At the FFV module, the audio data may be processed and provided to the voice privacy module. Processing may include voice extraction/enhancement. The processing may also include acoustic echo cancelation, which removes the reference signal(e.g., audio output by the voice control device) from the audio data. The reference signalmay be a signal received by the output device interfacefor outputting via a speaker. The output device interfacemay be either external toor part ofitself. The output of the FFV modulemay be the voice data of the user. In one or more implementations, the output of the FFV modulemay also be output to the ASR module(inside or outside the system), which in turn outputs speech recognition data(e.g., words, commands, and the like) without using any remote computer or cloud services.

At the voice privacy module, the audio data is anonymized. The voice privacy modulereceives the audio data and anonymizes audio data for the ASR SDK. The ASR SDKmay send the anonymized audio data to a remote computerfor ASR, receive the speech recognition datafrom the remote computer, and subsequently output the speech recognition datafor an application. The voice privacy modulealso or instead processes the audio data to separate the voice data of the userfrom the rest of the audio data, anonymize the voice data, and combine the anonymized voice data with the rest of the audio data such that the audio data is untouched except the voice data of the useris anonymized. The voice privacy modulemay output the anonymized voice data for the combined module(including the ASR SDK), which may send the anonymized voice data to a remote computerfor ASR, receive the speech recognition datafrom the remote computer, and subsequently output the speech recognition datafor an application. In one or more implementations, the voice privacy modulemay also remove the reference signalvia AEC. Details regarding the processes performed by the voice privacy moduleare discussed in more detail below with respect to,,, and.

In one or more implementations, one or more ASR SDKs and/or one or more ASR module may be included as part of the system. For example, extensionmay be an extension of the systemsuch that systemand extensionare on the same chip. As another example, extensionmay be a separate chip connected to the system.

illustrates a schematic diagram of the voice privacy module, in accordance with one or more aspects of the subject technology. The voice privacy moduleanonymizes the voice data in an audio data before sending the anonymized audio data to an ASR SDK (e.g., for remote ASR).

The audio data may be received from at least one microphone (e.g., a microphone array). An FFV processing module (e.g., FFV module) may receive the audio data and output voice data of the userbased on the audio data. The voice privacy modulemay receive the voice data and anonymize the voice data in an instanceof a voice anonymization algorithm of the anonymization module. The voice anonymization algorithm may be any voice anonymization method, such as a vocoder, x-vector-based voice conversion, and the like, and may include transforming any acoustic characteristic such as pitch, formant, inflection, timbre, and the like, and/or non-acoustic characteristic such as inflection and grammar. The voice privacy moduleoutputs the anonymized voice data.

Additionally or alternatively, the voice privacy modulereceives the audio data received from at least one microphone (e.g., a microphone array). The voice privacy modulemay separate the audio data at an audio source separation moduleinto at least a voice data and a noise data, and anonymize the voice data at a voice anonymization modulein an instanceof a voice anonymization algorithm of the anonymization module. The instancemay utilize the same or different voice anonymization algorithm as instanceand may be in the same or separate voice anonymization module. The anonymization modulemay output an anonymized voice data. The anonymized voice audio is combined with the other audio data and then is output as anonymized audio data.

In one or more implementations, the audio data may be pre-processed at the microphone array pre-processing module, before the audio source separation. The pre-processing moduleis configured to enhance the performance of the audio source separation moduleby modifying the audio data to emphasize the voice data of the userin the audio data. For example, the pre-processing modulemay perform transformations on the audio such as boosting the gain of the audio data and applying a high-pass filter to cut the frequencies below the frequencies of the voice data of the user. The pre-processed audio data may be output to the audio source separation moduleto separate the voice data of the userfrom the rest of the audio data.

In one or more implementations, the pre-processing transformation(s) applied at the pre-processing modulemay be inversed at the inverse pre-processing module. The inverse pre-processing modulemay modify the anonymized audio data to return the audio data to its original state but with the voice data anonymized. For example, if the pre-processing moduleapplies a gain boost and a high-pass filter to the audio data, the inverse pre-processing moduleapplies the inverse of the high-pass filter and gain boost on the anonymized audio data.

illustrates a schematic diagram of an audio source separation process, in accordance with one or more aspects of the subject technology. The audio source separation process may be performed by the audio source separation module. The audio source separation modulemay include an echo cancelation moduleconfigured to cancel audio feedback from the audio data played on the voice control device(e.g., the reference signal). To cancel the audio feedback, the audio source separation modulemay perform acoustic echo cancelation to remove an echo from the audio data (e.g., the pre-processed audio data), which was generated based on the reference signal.

The audio source separation modulemay also be configured to separate the audio data into at least a voice data and a noise data at a demixing module. To separate the audio data, the demixing modulemay perform blind source separation, beamforming, or any other audio data separation algorithms. The output of the audio separation may include a voice data and a noise data (e.g., the background noise, output noise). The audio source separation modulemay also be configured to apply further processing to the voice data to enhance the anonymization process (e.g., increase anonymization efficiency, reduce noise that may result from anonymization, and the like) at a post-gain module. For example, the post-gain modulemay increase the gain of the audio data. The voice audio may be output to the voice anonymization module.

illustrates a schematic diagram of a data amalgamation process, in accordance with one or more aspects of the subject technology. The data amalgamation process may be performed by the audio amalgamation module. The audio amalgamation modulemay be configured to combine the voice data, the noise data, and/or the reference signalsuch that the output of the audio amalgamation module is substantially the same as the audio data input to the voice privacy modulebut with the voice data anonymized. To combine the audio data, the post-gain applied by the post-gain modulemay be inversed by the inverse post-gain module. For example, if the post-gain moduleincreases the gain of the voice data, the inverse post-gain modulemay apply a decrease of the gain of the voice data.

Patent Metadata

Filing Date

Unknown

Publication Date

March 24, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search