US-10553218

Dimensionality reduction of baum-welch statistics for speaker recognition

PublishedFebruary 4, 2020

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

In a speaker recognition apparatus, audio features are extracted from a received recognition speech signal, and first order Gaussian mixture model (GMM) statistics are generated therefrom based on a universal background model that includes a plurality of speaker models. The first order GMM statistics are normalized with regard to a duration of the received speech signal. The deep neural network reduces a dimensionality of the normalized first order GMM statistics, and outputs a voiceprint corresponding to the recognition speech signal.

Patent Claims

14 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A speaker recognition apparatus comprising: a computer configured to: extract audio features from a received recognition speech signal; generate first order Gaussian mixture model (GMM) statistics from the extracted audio features based on a universal background model that includes a plurality of speaker models; normalize the first order GMM statistics with regard to a duration of the received speech signal; train a deep neural network having a plurality of fully connected layers using a set of recognition speech signals; and execute the deep neural network having the plurality of fully connected layers to reduce a dimensionality of the normalized first order GMM statistics and output a voiceprint corresponding to the recognition speech signal, the fully connected layers of the deep neural network including: an input layer configured to receive the normalized first order GMM statistics: one or more sequentially arranged first hidden layers configured to receive coefficients from the input layer; and a last hidden layer arranged to receive coefficients from one hidden layer of the one or more first hidden layers, the last hidden layer having a dimension smaller than each of the one or more first hidden layers and configured to output the voiceprint corresponding to the recognition speech signal.

2. The speaker recognition apparatus according to claim 1 , wherein the fully connected layers of the deep neural network further include an output layer for use in a training mode of the deep neural network, the computer configured to execute the output layer to receive coefficients from the last hidden layer and to calculate a plurality of output coefficients at a respective plurality of output units that correspond to distinct speakers represented in the set of recognition speech signals used for training the deep neural network; and the computer further configured to: receive the plurality of output coefficients and to calculate a loss result from the plurality of output coefficients, and lower the calculated loss result at each of a plurality of iterations by modifying one or more connection weights of the fully connected layers.

3. The speaker recognition apparatus according to claim 2 , wherein the computer is configured to utilize backpropagation to modify the connection weights of the fully connected layers based on the loss result during the training mode.

4. The speaker recognition apparatus according to claim 2 , wherein the computer is configured to utilize a categorical cross entropy function to calculate the loss result.

5. The speaker recognition apparatus according to claim 1 , wherein the number of the one or more first hidden layers is four.

6. The speaker recognition apparatus according to claim 1 , wherein for each received recognition speech signal the computer is configured to measure a duration of the received recognition speech signal and modify the first order statistics to correspond to a predetermined uniform duration.

7. The speaker recognition apparatus according to claim 6 , wherein computer is configured to randomly exclude up to 90% of the first order statistics from being received by the deep neural network.

8. A method of generating a speaker model, the method comprising: generating, by a computer, first order Gaussian mixture model (GMM) statistics from audio features extracted from a recognition speech signal, said GMM statistics being generated based on a universal background model that includes a plurality of speakers; normalizing, by the computer, the first order GMM statistics with regard to a duration of the received speech signal; training, by the computer, a deep neural network using a set of recognition speech signals; and reducing, by the computer, a dimensionality of the normalized first order GMM statistics using a plurality of fully connected feed-forward convolutional layers of the deep neural network and deriving a voiceprint corresponding to the recognition speech signal, wherein the reducing of the dimensionality of the normalized first order GMM statistics includes: receiving, by the computer, the normalized first order GMM statistics at an input layer of the plurality of fully connected feed-forward convolutional layers; receiving, by the computer, coefficients from the input layer at a first hidden layer of one or more sequentially arranged first hidden layers of the fully connected feed-forward convolutional layers, each first hidden layer receiving coefficients from a preceding layer of the plurality of fully connected feed-forward convolutional layers; receiving, by the computer at a last hidden layer, coefficients from one hidden layer of the one or more first hidden layers, the last hidden layer having a dimension smaller than each of the one or more first hidden layers; and outputting, by the computer from the last hidden layer, the voiceprint corresponding to the recognition speech signal.

9. The method according to claim 8 , further comprising: in a training mode, receiving, by the computer at an output layer of the fully connected feed-forward convolutional layers of the deep neural network, coefficients from the last hidden layer; calculating, by the computer, a plurality of output coefficients for output at a respective plurality of output units of the output layer, the number of output units corresponding to distinct speakers represented in the set of recognition speech signals used for training the deep neural network; receiving, by the computer, the plurality of output coefficients; and calculating, by the computer, a loss result from the plurality of output coefficients, wherein the computer lowers the calculated loss result at each of a plurality of training iterations by modifying one or more connection weights of the fully connected layers.

10. The method according to claim 9 , further comprising: performing, by the computer, backpropagation to modify connection weights of the fully connected feed-forward convolutional layers based on the loss result.

11. The method according to claim 9 , further comprising: calculating, by the computer, the loss result utilizing a categorical cross entropy function.

12. The method according to claim 8 , wherein the number of the one or more first hidden layers is four.

13. The method according to claim 8 , wherein for each received recognition speech signal said normalizing the first order GMM statistics includes: measuring, by the computer, a duration of the received recognition speech signal, and modifying, by the computer, the first order statistics to correspond to a predetermined uniform duration.

14. The method according to claim 8 , further comprising: excluding, by the computer, a majority of the first statistics from being from being received by the deep neural network by using a dropout technique.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G10L

Patent Metadata

Filing Date

September 19, 2017

Publication Date

February 4, 2020

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search