In classification tasks applicable to data that exhibit sequential output statistics, a classifier may be trained in an unsupervised manner based on a sequence of input samples and an unaligned sequence of output labels, using a cost function that measures the negative cross-entropy of an N-gram joint probability distribution derived from the sequence of output labels with respect to an expected N-gram frequency in a second sequence of output labels predicted by the classifier. In some embodiments, a primal-dual reformulation of the cost function is employed to facilitate optimization.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for training a classifier described by a posterior probability in an unsupervised manner, the method comprising: providing the posterior probability in a parametric form comprising one or more adjustable parameters; obtaining training data comprising a sequence of input samples and an N-gram joint probability distribution for a first sequence of output labels that are not correlated with the input samples; and using a computer processor to iteratively adjust the one or more adjustable parameters of the posterior probability to minimize a cost function, wherein the cost function comprises a negative cross-entropy of the N-gram joint probability distribution with respect to an expected N-gram frequency in a second sequence of output labels, the second sequence of output labels being predicted by the posterior probability from the sequence of input samples.
2. The method of claim 1 , wherein providing the N-gram joint probability distribution comprises providing the first sequence of output labels and computing the N-gram joint probability distribution from the first sequence of output labels.
3. The method of claim 1 , wherein adjusting the one or more adjustable parameters of the posterior probability comprises minimizing the cost function comprises computing a saddle point of an equivalent primal-dual formulation of the cost function.
4. The method of claim 3 , wherein the saddle point is computed using a stochastic primal-dual gradient method.
5. The method of claim 4 , wherein the stochastic primal-dual gradient method comprises initiating primal and dual variables of the primal-dual formulation of the cost function, and iteratively computing stochastic gradients of a component function with respect to the primal and dual variables and updating the primal variables in a direction of gradient descent and the dual variables in a direction of gradient ascent.
6. The method of claim 5 , wherein the stochastic primal-dual gradient method comprises, in computing the stochastic gradients, replacing sample averages by randomly sampled components or randomly sampled mini-batch averages.
7. The method of claim 1 , further comprising using the trained posterior probability to assign output labels to a test sequence of input samples.
8. The method of claim 1 , wherein the sequences of output labels are sequences of characters or words in a human language.
9. The method of claim 8 , wherein the sequence of input samples comprises a sequence of images.
10. The method of claim 8 , wherein the sequence of input samples comprises a voice recording.
11. One or more machine-readable media storing instructions for execution by one or more hardware processors, the instructions, when executed by the one or more hardware processors, causing the one or more hardware processors to perform operations for training a classifier described by a posterior probability provided in parametric form, the operations comprising: based on training data comprising a sequence of input samples and an N-gram joint probability distribution for a first sequence of output labels that are not correlated with the input samples, iteratively adjusting one or more parameters of the posterior probability to minimize a cost function, the cost function comprising a negative cross-entropy of the N-gram joint probability distribution with respect to an expected N-gram frequency in a second sequence of output labels, the second sequence of output labels being predicted by the posterior probability from the sequence of input samples.
12. The one or more machine-readable media of claim 11 , wherein the cost function is minimized by computing a saddle point of an equivalent primal-dual formulation of the cost function.
13. The one or more machine-readable media of claim 12 , wherein the saddle point is computed using a stochastic primal-dual gradient method.
14. The one or more machine-readable media of claim 13 , wherein the stochastic primal-dual gradient method comprises computing stochastic gradients by replacing sample averages in the primal-dual formulation of the cost function by randomly sampled components or randomly sampled mini-batch averages.
15. The one or more machine-readable media of claim 11 , the operations further comprising: using the trained posterior probability to assign output labels to a test sequence of input samples.
16. A system comprising: one or more hardware processors; and one or more machine-readable media storing instructions that, when executed by the one or more hardware processors, cause the one or more hardware processors to perform operations for training a classifier described by a posterior probability provided in parametric form, the operations comprising: based on training data comprising a sequence of input samples and an N-gram joint probability distribution for a first sequence of output labels that are not correlated with the input samples, iteratively adjusting parameters of the posterior probability to minimize a cost function, the cost function comprising a negative cross-entropy of the N-gram joint probability distribution with respect to an expected N-gram frequency in a second sequence of output labels, wherein the second sequence of output labels is predicted by the posterior probability from the sequence of input samples.
17. The system of claim 16 , wherein the cost function is minimized by computing a saddle point of an equivalent primal-dual formulation of the cost function.
18. The system of claim 17 , wherein the saddle point is computed using a stochastic primal-dual gradient method.
19. The system of claim 18 , wherein the stochastic primal-dual gradient method comprises computing stochastic gradients by replacing sample averages in the primal-dual formulation of the cost function by randomly sampled components or randomly sampled mini-batch averages.
20. The system of claim 16 , the operations further comprising: using the trained posterior probability to assign output labels to a test sequence of input samples.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 13, 2017
September 15, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.