US-9043213

Speech recognition and synthesis utilizing context dependent acoustic models containing decision trees

PublishedMay 26, 2015

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A speech recognition method including the steps of receiving a speech input from a known speaker of a sequence of observations and determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model. The acoustic model has a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation and has been trained using first training data and adapted using second training data to said speaker. The speech recognition method also determines the likelihood of a sequence of observations occurring in a given language using a language model and combines the likelihoods determined by the acoustic model and the language model and outputs a sequence of words identified from said speech input signal. The acoustic model is context based for the speaker, the context based information being contained in the model using a plurality of decision trees and the structure of the decision trees is based on second training data.

Patent Claims

11 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A speech recognition method executed by processing circuitry programmed to implement speech recognition, said method comprising: receiving a speech input from a speaker which comprises a sequence of observations; and determining, using the processing circuitry, a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker, determining, using the processing circuitry, a likelihood of a sequence of observations occurring in a given language using a language model; and combining, using the processing circuitry, the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as: ( m ^ MAP , λ ^ MAP ) = arg ⁢ max m , λ ⁢ { log ⁢ ⁢ p ⁡ ( O ❘ m , λ ) + α · log ⁢ ⁢ p ⁡ ( O ′ ❘ m , λ ) } where O′ is the first training data, O is the second training data, m denotes a parameter tying structure, λ is a set of HMM parameters, {circumflex over (m)} MAP denotes the parameter tying structure under maximum a posteriori criteria, {circumflex over (λ)} MAP are the HMM parameters under maximum a posteriori criteria and α is a parameter to be set.

2. The speech recognition method according to claim 1 , wherein the structure of the decision trees is based on both the first and second training data.

3. The method according to claim 1 , wherein the context dependency is implemented as tri-phones.

4. The method according to claim 1 , wherein said acoustic model comprises probability distributions which are represented by means and variances and wherein said decision trees are provided for both means and variances.

5. The method according to claim 1 , wherein said context based information is selected from phonetic, linguistic and prosodic contexts.

6. The method according to claim 1 , wherein said decision trees are used to model at least one selected from expressive contexts, gender, age or voice characteristics.

7. A non-transitory computer readable carrier medium carrying computer readable instructions for controlling the computer to carry out the method of claim 1 .

8. A text to speech processing method executed by processing circuitry programmed to implement text to speech processing, comprising: receiving a text input which comprises a sequence of words; and determining, using the processing circuitry, a likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to a speaker, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as: ( m ^ MAP , λ ^ MAP ) = arg ⁢ max m , λ ⁢ { log ⁢ ⁢ p ⁡ ( O ❘ m , λ ) + α · log ⁢ ⁢ p ⁡ ( O ′ ❘ m , λ ) } where O′ is the first training data, O is the second training data, m denotes a parameter tying structure, λ is a set of HMM parameters, {circumflex over (m)} MAP denotes the parameter tying structure under maximum a posteriori criteria, {circumflex over (λ)} MAP are the HMM parameters under maximum a posteriori criteria and α is a parameter to be set.

9. A speech recognition apparatus comprising: a receiver for receiving a speech input from a speaker which comprises a sequence of observations; and processing circuitry programmed to implement speech recognition and configured to: determine a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker; determine a likelihood of a sequence of observations occurring in a given language using a language model; and combine the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as: ( m ^ MAP , λ ^ MAP ) = arg ⁢ max m , λ ⁢ { log ⁢ ⁢ p ⁡ ( O ❘ m , λ ) + α · log ⁢ ⁢ p ⁡ ( O ′ ❘ m , λ ) } where O′ is the first training data, O is the second training data, m denotes a parameter tying structure, λ is a set of HMM parameters, {circumflex over (m)} MAP denotes the parameter tying structure under maximum a posteriori criteria, {circumflex over (λ)} MAP are the HMM parameters under maximum a posteriori criteria and α is a parameter to be set.

10. A text to speech system comprising: a receiver for receiving a text input which comprises a sequence of words; and processing circuitry programmed to implement text to speech processing and configured to: determine a likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as: ( m ^ MAP , λ ^ MAP ) = arg ⁢ max m , λ ⁢ { log ⁢ ⁢ p ⁡ ( O ❘ m , λ ) + α · log ⁢ ⁢ p ⁡ ( O ′ ❘ m , λ ) } where O′ is the first training data, O is the second training data, m denotes a parameter tying structure, λ is a set of HMM parameters, {circumflex over (m)} MAP denotes the parameter tying structure under maximum a posteriori criteria, {circumflex over (λ)} MAP are the HMM parameters under maximum a posteriori criteria and α is a parameter to be set.

11. The speech to speech translation system, said system comprising: a speech recognition system configured to recognize speech in a first language, a translation module configured to translate text received in a first language into text of a second language, and a text to speech system configured to output speech in said second language, wherein the speech recognition apparatus comprises: a receiver for receiving a speech input from a speaker which comprises a sequence of observations; and processing circuitry programmed to implement speech recognition and configured to: determine a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker; determine a likelihood of a sequence of observations occurring in a given language using a language model; and combine the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as: ( m ^ MAP , λ ^ MAP ) = arg ⁢ max m , λ ⁢ { log ⁢ ⁢ p ⁡ ( O ❘ m , λ ) + α · log ⁢ ⁢ p ⁡ ( O ′ ❘ m , λ ) } where O′ is the first training data, O is the second training data, m denotes a parameter tying structure, λ is a set of HMM parameters, {circumflex over (m)} MAP denotes the parameter tying structure under maximum a posteriori criteria, {circumflex over (λ)} MAP are the HMM parameters under maximum a posteriori criteria and α is a parameter to be set, and wherein the text to speech system comprises: a receiver for receiving a text input which comprises a sequence of words; and processing circuitry programmed to implement text to speech processing and configured to: determine a likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as: ( m ^ MAP , λ ^ MAP ) = arg ⁢ max m , λ ⁢ { log ⁢ ⁢ p ⁡ ( O ❘ m , λ ) + α · log ⁢ ⁢ p ⁡ ( O ′ ❘ m , λ ) } where O′ is the first training data, O is the second training data, m denotes a parameter tying structure, λ is a set of HMM parameters, {circumflex over (m)} MAP denotes the parameter tying structure under maximum a posteriori criteria, {circumflex over (λ)} MAP are the HMM parameters under maximum a posteriori criteria and α is a parameter to be set.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G10L G06F

Patent Metadata

Filing Date

January 26, 2011

Publication Date

May 26, 2015

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search