Systems and methods for automatically segmenting speech inventories. A set of Hidden Markov Models (HMMs) are initialized using bootstrap data. The HMMs are next re-estimated and aligned to produce phone labels. The phone boundaries of the phone labels are then corrected using spectral boundary correction. Optionally, this process of using the spectral-boundary-corrected phone labels as input instead of the bootstrap data is performed iteratively in order to further reduce mismatches between manual labels and phone labels assigned by the HMM approach.
Legal claims defining the scope of protection, as filed with the USPTO.
1. In a system that concatenates speech units to produce synthetic speech, a method for automatically segmenting unit labels, the method comprising: training a set of hidden Markov Models (HMMs) using seed data in a first iteration; aligning the set of HMMs using a Viterbi alignment to produce segmented unit labels; and adjusting boundaries of the unit labels using spectral boundary correction, wherein the unit labels having adjusted boundaries are used to concatenate speech units to synthesize speech.
2. A method as defined in claim 1 , wherein training a set of Hidden Markov Models further comprises: initializing the set of HMMs using at least one of hand-labeled bootstrapped data; speaker-independent HMM bootsrrapped data, and flat start data; re-estimating the set of HMMs; and performing an embedded re-estimation on the set of HMMs.
3. A method as defined in claim 1 , wherein adjusting boundaries of the unit labels using spectral boundary correction further comprises adjusting boundariesof the unit labels within specifed time windows.
4. A method as defined in claim 1 , wherein adjusting boundaries of the unit labels using spectral boundary correction further comprises: combining HMM-based segmentation with spectral features to reduce misalignments between target unit boundaries and boundaries and assigned by the HMM-based segmentation.
5. A method as defined in claim 1 , wherein adjusting boundaries of the unit labels using spectral boundary correction further comprises: identifing context dependent time windows around the unit boundaries, wherein the unit boundaries include one or more of: a vowel-to-vowel boundary; a vowel-to-nasal boundary; a vowel-to-voiced stop boundary; a vowel-to-liquid boundary; a vowel-to-unvoiced stop boundary; a vowel-to-voiced fricative boundary; an unvoiced stop-to-vowel boundary; a nasal-to-vowel boundary; a voiced stop-to-vowel boundary a liquid-to-vowel boundary an unvoiced fricative-to-vowel boundary; and a voiced fricative-to-vowel boundary.
6. A method as defined in claim 5 , wherein context-dependent time windows are empirically determined by adjacent phones.
7. A method as defined in claim 1 , further comprising using the unit labels whose boundaries, have, been adjusted by spectral boundary correction as input for a next iteration of: training a set of HMMs; aligning the set of HMMs using a Viterbi alignment to produce phone labels; and adjusting boundaries of the unit labels using spectral boundary correction.
8. A computer-readable media having computer-executable instructions for implementing the method of claim 1 .
9. In a system having a speech inventory that includes phone labels that are concatenated to from synthetic speech, a method for segmenting the phone labels, the method comprising: performing a first alignment on a trained set of HMM to produce phone labels that are segmented, wherein each phone label has a spectral boundary; and performing spectral boundary correction on the phone labels, wherein spectral boundary correction re-aligns each boundary using bending points of spectral transitions, wherein the phone labels having spectral boundary correction are used for speech synthesis.
10. A method as defined in claim 9 , wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises bootstrapping the set of HMM with at least one of sneaker-dependent HMMs and speaker-independent HMMs.
11. A method as defined in claim 9 , wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises: initializing the set of HMMs; re-estimating the set of HMMs; and performing embedded re-estimation on the set of HMMs.
12. A method as defined in claim 9 , wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises performing a Viterbi alignment on the trained set of HMMs to phone labels that are segmented.
13. A method as defined in claim 11 , wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented and performed spectral boundary correction on the phone labels are performed iteratively.
14. A method as defined in claim 13 , further comprising training the set of HMMs using phone labels having boundaries that have been re-aligned using spectral boundary correction.
15. A method as defied in claim 9 wherein performing spectral boundary correction on the phone lands further comprises performing spectral boundary correction on the phone labels within a context-dependent time window.
16. A method as defined in claim 15 , further comprising empirically determining the context-dependent time window using adjacent phones.
17. A method as defined in claim 15 , wherein each spectral boundary is between a first phone class and second phone class.
18. A computer-readable media having computer-executable instructions for implementing the method of claim 9 .
19. A method for segmenting phone labels to reduce misalignments in order to improve synthetic speech when the phone labels are concatenated, the method comprising: training a set of HMMs using one of a specific speaker's hand-labeled speech data and speaker-independent speech data; segmenting the trained set of HMMs using a first alignment to produce phone labels, wherein each phone label has a spectral boundary; using a weighted slope metric to identify bending points of spectral transitions, where each bending point corresponds to a spectral boundary; and correcting a particular spectral boundary of a particular phone label if the particular spectral boundary does not coincide with a particular bending point, wherein the phone labels with corrected spectral boundaries are used for speech synthesis.
20. A method as defined in claim 19 , wherein using a weighted slope metric to identify bending points of spectral transitions further comprises applying the weighted slope metric within context-dependent time windows such that spectral boundaries are not applied to the phone labels.
21. A method as defined in claim 20 , further comprising retraining the set of HMMs using, the phone labels that have been corrected using the weighted slope metric.
22. A method as defined in claim 20 , wherein each spectral boundary is defined by a first phone class and a second phone class, wherein the first phone class and the second phone class include at least one of a vowel, an unvoiced stop, a voiced fricative, a voiced fricative, a liquid class and a nasal class.
23. A method as defined in claim 20 , further comprising determining context dependent time windows empirically.
24. A computer-readable media having computer-executable instructions for performing the method of claim 19 .
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
January 14, 2003
September 4, 2007
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.