Systems and techniques for a reverberation cancellation framework include receiving a far-field audio signal from a far-field microphone array and a near-field audio signal from a near-field microphone array, where the far-field microphone array is a greater distance from an audio source than the near-field microphone array. The far-field audio signal and the near-field audio signal are synchronized. The far-field audio signal and the near-field audio signal are encoded to remove noise artifacts from the far-field audio signal and the near-field audio signal. The far-field audio signal and the near-field audio signal are decoded to output an output audio signal with the noise artifacts removed.
Legal claims defining the scope of protection, as filed with the USPTO.
. A method comprising:
. The method of, wherein encoding the far-field audio signal and the near-field audio signal comprises:
. The method of, wherein the machine learning module is a convolutional neural network.
. The method of, wherein transforming the far-field audio signal and the near-field audio signal comprises performing a short-time Fourier transform on the far-field audio signal and the near-field audio signal to output the image representations of the far-field audio signal and the near-field audio signal.
. The method of, wherein decoding the far-field audio signal and the near-field audio signal comprises:
. The method of, wherein the noise artifacts include reverberation.
. The method of, wherein the near-field microphone arrangement includes one or more microphones on at least one of a phone, a tablet, an earbud, or a home assistant device.
. The method of, wherein the far-field microphone arrangement is an array of a plurality of microphones.
. A computer program product, the computer program product being tangibly embodied on a non-transitory computer-readable medium and comprising instructions that, when executed by at least one computing device, are configured to cause the at least one computing device to:
. The computer program product of, wherein encoding the far-field audio signal and the near-field audio signal comprises instructions that, when executed by the at least one computing device, are configured to cause the at least one computing device to:
. The computer program product of, wherein the machine learning module is a convolutional neural network.
. The computer program product of, wherein transforming the far-field audio signal and the near-field audio signal comprises instructions that, when executed by the at least one computing device, are configured to cause the at least one computing device to perform a short-time Fourier transform on the far-field audio signal and the near-field audio signal to output the image representations of the far-field audio signal and the near-field audio signal.
. The computer program product of, wherein decoding the far-field audio signal and the near-field audio signal comprises instructions that, when executed by the at least one computing device, are configured to cause the at least one computing device to:
. The computer program product of, wherein the noise artifacts include reverberation.
. The computer program product of, wherein the near-field microphone arrangement includes one or more microphones on at least one of a phone, a tablet, an earbud, or a home assistant device.
. The computer program product of, wherein the far-field microphone arrangement is an array of a plurality of microphones.
. A system, comprising:
. The system of, wherein the encoder module is configured to:
. The system of, wherein the machine learning module is a convolutional neural network.
. The system of, wherein the encoder module is configured to perform a short-time Fourier transform on the far-field audio signal and the near-field audio signal to output the image representations of the far-field audio signal and the near-field audio signal.
. The system of, wherein the decoder module is configured to:
. The system of, wherein the noise artifacts include reverberation.
. The system of, wherein the near-field microphone arrangement includes one or more microphones on at least one of a phone, a tablet, an earbud, or a home assistant device.
. The system of, wherein the far-field microphone arrangement is an array of a plurality of microphones.
Complete technical specification and implementation details from the patent document.
This description relates to a reverberation cancellation framework.
In an example teleconferencing setting, with or without video, various factors may affect the quality of the sound captured from a speaker and transmitted to a listener. One factor is reverberation, also referred to interchangeably throughout as reverb. Reverberation or reverb may occur when the speaker is in a spacious room. A result of reverb includes the speaker's speech accompanied by echoing sounds that are heard by the listener. It is desirable for the listener to receive and hear clean speech (i.e., free from audio defects like reverb) from the speaker.
This document describes systems and techniques for reducing and/or eliminating the effects of reverb. A reverb cancellation framework includes a combination of a microphone arrangement, e.g., associated with a conference system (with or without video) and one or more microphones associated with other devices (e.g., a phone, a watch, an earbud, a voice-assistant device, a laptop, etc.). The microphone arrangement associated with the audio conference system may be a far-field microphone arrangement and the one or more microphones associated with the other devices may be a near-field microphone arrangement. The reverb cancellation framework synchronizes the audio from the far-field microphone arrangement and the near-field microphone arrangement. The synchronized audio from the far-field microphone arrangement and the near-field microphone arrangement may be processed by a multi-head, speech enhancement network to output reverb-free speech that is transmitted to the listener.
In some aspects, the techniques described herein relate to a method including: receiving a far-field audio signal from a far-field microphone arrangement and a near-field audio signal from a near-field microphone arrangement, the far-field microphone arrangement being at a greater distance from an audio source than the near-field microphone arrangement; synchronizing the far-field audio signal and the near-field audio signal; encoding the far-field audio signal and the near-field audio signal to remove noise artifacts from the far-field audio signal and the near-field audio signal; and decoding the far-field audio signal and the near-field audio signal to output an output audio signal with the noise artifacts removed.
In some aspects, the techniques described herein relate to a computer program product, the computer program product being tangibly embodied on a non-transitory computer-readable medium and including instructions that, when executed by at least one computing device, are configured to cause the at least one computing device to: receive a far-field audio signal from a far-field microphone arrangement and a near-field audio signal from a near-field microphone arrangement, the far-field microphone arrangement being at a greater distance from an audio source than the near-field microphone arrangement; synchronize the far-field audio signal and the near-field audio signal; encode the far-field audio signal and the near-field audio signal to remove noise artifacts from the far-field audio signal and the near-field audio signal; and decode the far-field audio signal and the near-field audio signal to output an output audio signal with the noise artifacts removed.
In some aspects, the techniques described herein relate to a system, including: at least one processor; and a non-transitory computer-readable medium including instructions that, when executed by the at least one processor, cause the system to implement a synchronization module, an encoder module, and a decoder module, wherein: the synchronization module is configured to: receive a far-field audio signal from a far-field microphone arrangement and a near-field audio signal from a near-field microphone arrangement, the far-field microphone arrangement being at a greater distance from an audio source than the near-field microphone arrangement, and synchronize the far-field audio signal and the near-field audio signal; the encoder module is configured to encode the far-field audio signal and the near-field audio signal to remove noise artifacts from the far-field audio signal and the near-field audio signal; and the decoder module is configured to decode the far-field audio signal and the near-field audio signal to output an output audio signal with the noise artifacts removed.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
This document describes technical solutions to technical problems associated with reverberation in an audio conference system. As discussed above, reverberation may occur when a speaker is in a spacious room and the array of microphones associated with the audio conference system are at a large distance from the speaker. An effect of the reverb is to output a speech signal that includes echoing audio sounds. The speech signals with the echoing audio sounds are transmitted to and heard by the listener. The technical solutions include systems and techniques to eliminate and/or reduce the reverb. That is, the technical solutions include a reverb cancellation framework that cancels the reverb and removes the effects of the echoing audio sounds from the speech. The technical effect is to output and produce reverb-free (or near reverb-free) speech from a speaker to a listener.
The reverb cancellation framework includes a combination of a microphone arrangement associated with an audio conference system (with or without video) and one or more microphones associated with other devices (e.g., a phone, a watch, an earbud, a voice-assistant device, a laptop, etc.). The microphone arrangement associated with the audio conference system may be a far-field microphone arrangement and the one or more microphones associated with the other devices may be a near-field microphone arrangement. The reverb cancellation framework synchronizes the audio from the far-field microphone arrangement and the near-field microphone arrangement. The synchronized audio from the far-field microphone arrangement and the near-field microphone arrangement is processed by a multi-head, speech enhancement network to output reverb-free speech that is transmitted to the listener.
is a block diagram of a systemfor a reverberation cancellation framework. The systemincludes a far-field microphone arrangementand a near-field microphone arrangement. The systemincludes a synchronization module, an encoder module, a decoder module, and a network. In general, the systemis configured to receive far-field audio signalsfrom the far-field microphone arrangementand near-field audio signalsfrom the near-field microphone arrangementand to output a reverb-free audio signal. For example, the far-field microphone arrangementmay capture far-field audio signalsfrom an audio source, such as a speaker. The near-field microphone arrangementmay capture near-field audio signalsfrom the same audio source, such as the same speaker. The far-field audio signalsand the near-field audio signalsare processed by the synchronization module, the encoder module, and the decoder moduleto output the reverb-free audio signal, as described in more detail below.
The systemalso includes at least one memoryand at least one processor. The at least one processormay represent two or more processors in the systemexecuting in parallel and utilizing corresponding instructions stored using the at least one memory. The at least one processormay include at least one central processing unit (CPU). The at least one memoryrepresents a non-transitory computer-readable storage medium. Of course, similarly, the at least one memorymay represent one or more different types of memory utilized by the system. In addition to storing instructions, which allow the at least one processorto implement the systemand its various components, the at least one memorymay be used to store data and other information used by and/or generated by the systemand its components.
depicts an audio source. In this example, the audio sourcemay be a person and may be referred to as the speaker. The audio source, in this case the speaker, may be communicating with a listenerover a teleconference system. The teleconference systemmay be capable of receiving, transmitting, playing, and displaying both audio and video signals. In some implementations, the teleconference systemmay be capable of receiving, transmitting, and playing audio signals without video signals.
The teleconference systemincludes the far-field microphone arrangement, which is the same far-field microphone arrangementof. The far-field microphone arrangementcaptures the audio signals from the audio source, including speech. The audio signals captured by the far-field microphone arrangementmay be referred to as the far-field audio signals, as also referred to in. The far-field microphone arrangementmay include one or more microphones (e.g., one, two, three, four, five, six, etc.). The far-field microphone arrangementmay include high quality microphones. That is, the far-field microphone arrangementmay include hi-fidelity microphones.
In this example, a devicemay include the near-field microphone array, which is the same near-field microphone arrangementof. The near-field microphone arrangementalso captures the audio signals from the audio source, including speech. The audio signals captured by the near-field microphone arrangementmay be referred to as the near-field audio signals, as also referred to in. The near-field microphone arrangementmay include one or more microphones (e.g., one, two, three, four, five, six, etc.). The near-field microphone arrangementmay include low quality microphones. That is, the near-field microphone arrangementmay include low-fidelity microphones.
The deviceincludes a computing device that includes the near-field microphone arrangementhaving one or more microphones. For example, the devicemay include a phone, a watch, an earbud, a voice-assistant device, a laptop, or the like. In some implementations, the near-field microphone arrangementin the deviceis always-on or at least nearly always-on. This means that the devicemay be capable of capturing the near-field audio signalswithout being specifically activated by the user. That is, as long as the deviceis turned on and not placed in a mode not to capture the near-field audio signals, then the devicewill be capable of capturing the near-field audio signals.
In some implementations, the quality and fidelity of the microphones in the far-field microphone arrangementand the microphones in the near-field microphone arrangementare relative to each other. For example, the far-field microphone arrangementmay include higher quality and/or higher fidelity microphones relative to the near-field microphone arrangement, which may include lower quality and/or lower fidelity microphones.
In some implementations, the number of microphones in the far-field microphone arrangementis greater than the number of microphones in the near-field microphone arrangement. For example, the far-field microphone arrangementmay include six or more microphones and the near-field microphone arrangementmay include two or fewer microphones.
As illustrated in, the distance, d, is the distance between the audio sourceand the far-field microphone arrangement. The distance, d′, is the distance between the audio sourceand the near-field microphone arrangement. The distance, d, is greater than the distance, d′. That is, the far-field microphone arrangementis farther away from the audio sourcethan the near-field microphone arrangement. Said another way, the near-field microphone arrangementis closer to the audio sourcethan the far-field microphone arrangement. In this manner, the audio signals from the audio sourcewill be received by and captured by the near-field microphone arrangementsooner or earlier in time than the far-field microphone arrangement. Said another way, there is a time delay or a difference in time between the time that the audio signals from the audio sourcereach the far-field microphone arrangementcompared to the time that the audio signals from the audio sourcereach the near-field microphone arrangement.
In one use example, the audio sourcemay be the speaker in a teleconference using the teleconference system. The far-field microphone arrangementis on the teleconference systemat a distance, d, from the speaker. The devicemay be the speaker's phone on a table near the speaker. The phone includes the near-field microphone arrangementthat is integrated as part of the phone. The near-field microphone arrangementis at a distance, d′, from the speaker. Other use examples may be similar but the devicemay be a device other than the speaker's phone such as, for example, such as a tablet computer, a laptop computer, a home assistant device, or other type of computing device.
Without the systemof, meaning with only the far-field microphone arrangementand without the near-field microphone arrangementand without the other components of, the non-direct audio peaks of the audio signals from the audio sourcebecome very noticeable and audible to the listenerbecause of the high room reverb effects due to the distance, d. In contrast, with the system, including the near-field microphone arrangementand the other components of, the effects of the high room reverb can be cancelled and eliminated, at least to the extent that there is a reduced or even no noticeable reverb effect and a reduced or even no noticeable audible effect to the listener.
Referring to bothand, the far-field microphone arrangementreceives and captures the audio signals from the audio sourceand the near-field microphone arrangementreceives and captures the audio signals from the audio source. The audio signals captured by the far-field microphone arrangementare referred to as far-field audio signalsand the audio signals captured by the near-field microphone arrangementare referred to as near-field audio signals. Both the far-field audio signalsand the near-field audio signalsare input to the synchronization module.
The synchronization modulereceives the far-field audio signalsand the near-field audio signals. Because there is a delay due to the distances between the audio sourceand the far-field microphone arrangementand the audio sourceand the near-field microphone arrangement, the synchronization moduleis configured to synchronize the far-field audio signalsand the near-field audio signalsto remove the time delay so that the far-field audio signalsand the near-field audio signalscan be processed to remove the noise, including any audio effects due to reverb. Additionally, because the teleconference systemand the devicehave different system clocks, the synchronization moduleis configured to synchronize the far-field audio signalsand the near-field audio signals.
The synchronization modulemay include one or more buffers to buffer the far-field audio signalsand the near-field audio signalsfor synchronization. For example, the synchronization modulemay include a first bufferand a second buffer. The first buffermay store the far-field audio signalsand the second buffermay store the near-field audio signals. In some implementations, the first bufferand the second buffermay be portions of a single buffer. In some implementations, the first bufferand the second buffermay be separate buffers. After buffering the far-field audio signalsand the near-field audio signals, the synchronization modulealigns the far-field audio signalsand the near-field audio signalssuch that the far-field audio signalsand the near-field audio signalsare synchronized.
In some implementations, the synchronization modulemay use an unsupervised machine learning module or other unsupervised method to align the audio features, including the speech features, from the far-field audio signalsand the near-field audio signals. For example, when the audio sourceis a speaker, the speech from the speaker that is captured by the far-field microphone arrangementand the near-field microphone arrangementand that is recorded and buffered by the synchronization module, is aligned so that the timing of the far-field audio signalsand the near-field audio signalsmatch.
In some implementations, the synchronization modulemay use a type of cross-device communications to align the far-field audio signalsand the near-field audio signals. For example, the synchronization modulemay use protocols such as Bluetooth low energy (BLE) or Wi-Fi Direct to coordinate the alignment of the far-field audio signalsand the near-field audio signals.
In some implementations, a pattern in the far-field audio signalsmay be matched with a pattern in the near-field audio signals. To synchronize the audio features, including the speech features, from the far-field audio signalsand the near-field audio signals, one of the far-field audio signalsor the near-field audio signalsmay be delayed by a time determined by the synchronization module.
The synchronization modulesends the synchronized far-field audio signalsand the near-field audio signalsto the encoder module. The encoder modulereceives the synchronized far-field audio signalsand near-field audio signals. The far-field audio signalsand the near-field audio signalsare in the time domain. The encoder moduleis configured to transform the far-field audio signalsand the near-field audio signalsto the frequency domain or spectral domain.
Referring to, in some implementations, the encoder moduleincludes a short-time Fourier transform (STFT) module. The STFT moduleincludes one or more STFT blocks that are configured to transform the synchronized far-field audio signalsand the near-field audio signalsfrom the time domain to the frequency domain. The output from the STFT moduleis representations of the far-field audio signalsand the near-field audio signalsas spectrograms. The spectrograms are visual representations (e.g., images) of the spectrum of frequencies represented by the original synchronized, time domain far-field audio signalsand the near-field audio signals. For example, the spectrograms show an intensity of the respective signal versus frequency and time. The spectrograms are processed by a machine learning modulethat is part of the encoder module.
The machine learning moduleis configured to process the spectrograms to remove the noise artifacts, including the reverb, and to combine the far-field audio signalsand the near-field audio signalsinto a single, reverb-free audio signal. In some implementations, machine learning moduleis trained with training data comprising spectrograms of far-field audio signalsand the near-field audio signalsincluding reverb (or other noise artifacts), and with training data comprising spectrograms of far-field audio signalsand the near-field audio signalsnot including reverb (or other noise artifacts). In some implementations, the machine learning moduleis a convolutional neural network (CNN). In some implementations, the CNN is a U-Net, where the U-Net is a custom U-Net that is custom-trained to output a visual representation of the audio signal from the audio sourcethat is reverb-free. In some implementations, the CNN is a Mark R-CNN, where the Mark R-CNN is a custom Mark-R CNN that is custom-trained to output a visual representation of the audio signal from the audio sourcethat is reverb-free.
In these implementations, the CNN includes multiple convolutional layers (e.g., multi-head attention layers) in which the desired spatial properties are extracted from the spectrograms. For example, the far-field audio signalsfrom the far-field microphone arrangementmay include accurate beam steering properties and high-frequency, intelligibility properties of user speech. These desired properties are extracted from the spectrograms representing the far-field audio signals. The undesirable properties, including the reverb and other noise properties, from the spectrograms representing the far-field audio signalsare not extracted.
The near-field audio signalsfrom the near-field microphone arrangementmay include speech properties with good audio masks with accurate subtraction of room models. These desired properties are extracted from the spectrograms representing the near-field audio signals. Any undesirable properties from the spectrograms representing the near-field audio signalsare not extracted.
The encoder moduleextracts the desired properties (or features) and encodes the desired properties. The encoded output of the encoder moduleis then communicated to the decoder module. Referring to, the decoder moduleincludes a deconvolution moduleand an inverse STFT (iSTFT) module. The deconvolution moduleconverts the output of the encoder moduleback to an image frequency representation that is accurate in reverb cancellation but still maintains the desired properties and qualities of the audio signal, including the speech. The iSTFT moduleperforms an inverse transformation to transform the image frequency representation (e.g., spectral domain features) back to a time domain output of the audio signal with reverb cancellation, that is the reverb-free audio signal.
Referring back to, the reverb-free audio signalis communicated over the networkwhere it may be received on the other end of the teleconference systemand heard by the listener.
In some implementations, the synchronization module, the encoder module, and the decoder modulemay be implemented as part of the teleconference system. That is, these components perform processing of their functions using memory and at least one processor on the teleconference system.
In some implementations, the synchronization module, the encoder module, and the decoder modulemay be implemented and split between the teleconference systemand the device. For example, each of the teleconference systemand the devicemay include an encoder moduleto process the respective far-field audio signalson the teleconference systemand the near-field audio signalson the device, either before or after being synchronized by a synchronization module. In this example, the encoder moduleon each of the teleconference systemand the devicemay perform its functions and then a synchronization moduleon either the teleconference systemor the devicemay synchronize the output of the encoder module.
In some implementations, the decoder modulemay be implemented after the network. That is, the output of the encoder modulemay be communicated over the networkand then a decoder moduleon the listener side of the networkmay perform the decoding functions to output the reverb-free audio signalfor the listener.
In some implementations, the encoder module, which includes the machine learning module, may be trained using a supervised machine learning process. For example, the machine learning modulemay be trained using a high-fidelity microphone near the audio sourceas a ground truth. In this manner, the output of the audio sourceusing a high fidelity microphone can be used as a labelled ground truth to train the encoder moduleto eliminate the reverb from the far-field audio signals.
In some implementations, the near-field microphone arrangementmay include microphones from multiple, different devices. That is the devicemay represent multiple devices that are near the audio source. The microphones from the multiple devices may contribute to and be considered a part of the near-field microphone arrangement.
Additionally, if the devicerepresents multiple devices, a threshold of audio energy may be used to determine whether or not to include the microphones from a particular device as part of the near-field microphone arrangement. That is, a device may be too far from the audio sourceor may be moved away from the audio sourcesuch that it does not contribute enough.
illustrates an example processfor using the systemof. Processis a computer-implemented method that may be implemented by the systemof, including the synchronization module, the encoder module, and the decoder module. Instructions and/or executable code for the processmay be stored in the at least one memory, and the stored instructions may be executed by the at least one processor. Processis also illustrative of a computer program product that may be implemented by the systemof.
Processincludes receiving a far-field audio signal from a far-field microphone arrangement and a near-field audio signal from a near-field microphone arrangement, the far-field microphone arrangement being at a greater distance from an audio source than the near-field microphone arrangement (). For example, the synchronization moduleofis configured to receive a far-field audio signal from a far-field microphone arrangement and a near-field audio signal from a near-field microphone arrangement, the far-field microphone arrangement being at a greater distance from an audio source than the near-field microphone arrangement.
Processincludes synchronizing the far-field audio signal and the near-field audio signal (). For example, the synchronization moduleofis configured to synchronize the far-field audio signal and the near-field audio signal.
Processincludes encoding the far-field audio signal and the near-field audio signal to remove noise artifacts from the far-field audio signal and the near-field audio signal (). For example, the encoder moduleis configured to encode the far-field audio signal and the near-field audio signal to remove noise artifacts from the far-field audio signal and the near-field audio signal.
Processincludes decoding the far-field audio signal and the near-field audio signal to output an output audio signal with the noise artifacts removed (). For example, the decoder moduleofis configured to decode the far-field audio signal and the near-field audio signal to output an output audio signal with the noise artifacts removed.
Example 1: A method comprising: receiving a far-field audio signal from a far-field microphone arrangement and a near-field audio signal from a near-field microphone arrangement, the far-field microphone arrangement being at a greater distance from an audio source than the near-field microphone arrangement; synchronizing the far-field audio signal and the near-field audio signal: encoding the far-field audio signal and the near-field audio signal to remove noise artifacts from the far-field audio signal and the near-field audio signal; and decoding the far-field audio signal and the near-field audio signal to output an output audio signal with the noise artifacts removed.
Example 2: The method of Example 1, wherein encoding the far-field audio signal and the near-field audio signal comprises: transforming the far-field audio signal and the near-field audio signal into image representations of the far-field audio signal and the near-field audio signal; and processing the image representations through a machine learning module to output encoded audio signals with the noise artifacts removed.
Example 3: The method of Example 2, wherein the machine learning module is a convolutional neural network.
Example 4: The method of Example 2 or 3, wherein transforming the far-field audio signal and the near-field audio signal comprises performing a short-time Fourier transform on the far-field audio signal and the near-field audio signal to output the image representations of the far-field audio signal and the near-field audio signal.
Example 5: The method of any one of Examples 2 to 4, wherein decoding the far-field audio signal and the near-field audio signal comprises: converting the encoded audio signals to image representations with the noise artifacts removed; and performing an inverse short-time Fourier transform on the image representations with the noise artifacts removed into the output audio signal with the noise artifacts removed.
Example 6: The method of any one of Examples 1 to 5, wherein the noise artifacts include reverberation.
Unknown
December 4, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.