Processing Multi-Channel Audio Waveforms

PublishedJuly 4, 2017

Assigneenot available in USPTO data we have

InventorsTara N. Sainath Ron J. Weiss Kevin William Wilson Andrew W. Senior Arun Narayanan+2 more

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A system comprising: one or more computers and one or more data storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs.

2. The system of claim 1 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other.

3. The system of claim 1 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.

4. The system of claim 1 , wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution.

5. The system of claim 3 , wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers.

6. The system of claim 1 , wherein combining the convolution outputs comprises: summing, for each of the multiple filters, the convolution outputs obtained for different channels using the filter to generate summed outputs corresponding to different time periods; and pooling, for each of the multiple filters, the summed outputs across the different time periods to generated a set of pooled values for the filter.

7. The system of claim 6 , wherein pooling the summed outputs across the different time periods comprises max pooling the summed outputs across the different time periods to identify maximum values among the summed outputs for the different time periods.

8. The system of claim 6 , wherein combining the convolution outputs comprises applying a rectified non-linearity to the sets of pooled values for each of the multiple filters to obtain rectified values; wherein inputting the combined convolution outputs to the deep neural network comprises inputting the rectified values to the deep neural network.

9. The system of claim 8 , wherein the rectified non-linearity comprises a logarithm compression.

10. The system of claim 1 , wherein the filters are configured to perform both spatial and spectral filtering.

11. The system of claim 1 , wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model comprises training the multiple filters and the deep neural network using a single module of an automated speech recognizer.

12. The system of claim 1 , wherein the training process that jointly trains the multiple filters and trains the deep neural network as an acoustic model is performed using training data that includes audio data from a plurality of different microphone spacing configurations.

13. A computer-implemented method comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs.

14. The method of claim 13 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other.

15. The method of claim 13 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.

16. The method of claim 13 , wherein the convolutional layer of the deep neural network is configured to perform a frequency domain convolution.

17. The method of claim 15 , wherein the deep neural network is configured such that output of convolutional layer is input to at least one of the one or more LSTM layers, and output of the one or more LSTM layers is input to at least one of the multiple hidden layers.

18. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined based at least on output that the deep neural network provides in response to receiving the combined convolution outputs.

19. The non-transitory computer-readable medium of claim 18 , wherein the multiple channels of audio data are multiple channels of audio waveform data for the utterance, wherein the multiple channels of audio waveform are recordings of the utterance by different microphones that are spaced apart from each other.

20. The non-transitory computer-readable medium of claim 18 , wherein the deep neural network is a deep neural network comprising a convolutional layer, one or more long-short term memory (LSTM) layers, and multiple hidden layers.

Patent Metadata

Filing Date

Unknown

Publication Date

July 4, 2017

Inventors

Tara N. Sainath

Ron J. Weiss

Kevin William Wilson

Andrew W. Senior

Arun Narayanan

Yedid Hoshen

Michiel A.U. Bacchiani

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search