Automatic Segmentation in Speech Synthesis

PublishedSeptember 8, 2009

Assigneenot available in USPTO data we have

InventorsAlistair D. Conkie Yeon-Jun Kim

Technical Abstract

Patent Claims

21 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A system that concatenates speech units to produce synthetic speech, the system comprising: a processor; a module configured to control the processor to train a set of Hidden Markov Models (HMMs) using seed data in a first iteration; a module configured to control the processor to align the set of HMMs to produce segmented unit labels; and a module configured to control the processor to adjust boundaries of the unit labels using spectral boundary correction, wherein the unit labels having adjusted boundaries are used to concatenate speech units to synthesize speech.

2. The system in claim 1 , wherein the module configured to control the processor to train the set of Hidden Markov Models further: initializes the set of HMMs using at least one of hand-labeled bootstrapped data, speaker-independent HMM bootstrapped data, and flat start data; re-estimates the set of HMMs; and performs an embedded re-estimation on the set of HMMs.

3. The system of claim 1 , wherein the module configured to control the processor to adjust boundaries of the unit labels using spectral boundary correction further adjusts boundaries of the unit labels within specified time windows.

4. The system of claim 1 , wherein the module configure to control the processor to adjust boundaries of the unit labels using spectral boundary correction further: combines HMM-based segmentation with spectral features to reduce misalignments between target unit boundaries and boundaries assigned by the HMM-based segmentation.

5. The system of claim 1 , wherein the module configured to control the processor to adjust boundaries of the unit labels using spectral boundary correction further: identifies context-dependent time windows around the unit boundaries, wherein the unit boundaries include one or more of: a vowel-to-vowel boundary; a vowel-to-nasal boundary; a vowel-to-voiced stop boundary; a vowel-to-liquid boundary; a vowel-to-unvoiced stop boundary; a vowel-to-voiced fricative boundary; an unvoiced stop-to-vowel boundary; a nasal-to-vowel boundary; a voiced stop-to-vowel boundary a liquid-to-vowel boundary; an unvoiced fricative-to-vowel boundary; and a voiced fricative-to-vowel boundary.

6. The system of claim 5 , wherein context-dependent time windows are empirically determined by adjacent phones.

7. The system of claim 1 , further comprising a module configured to control the processor to use the unit labels whose boundaries have been adjusted by spectral boundary correction as input for a next iteration of: training a set of HMMs; aligning the set of HMMs using a Viterbi alignment to produce phone labels; and adjusting boundaries of the unit labels using spectral boundary correction.

8. A system having a speech inventory that includes phone labels that are concatenated to form synthetic speech, the system comprising: a processor; a module configured to control the processor to perform a first alignment on a trained set of HMMs to produce phone labels that are segmented, wherein each phone label has a spectral boundary; a module configured to control the processor to perform spectral boundary correction on the phone labels, wherein spectral boundary correction re-aligns each spectral boundary using bending points of spectral transitions; and a module configured to control the processor to synthesize speech using the phone labels having spectral boundary correction.

9. The system of claim 8 , wherein the module configured to control the processor to perform a first alignment on a trained set of HMMs to produce phone labels that are segmented further bootstraps the set of HMMs with at least one of speaker-dependent HMMs and speaker-independent HMMs.

10. The system of claim 8 , wherein the module configured to control the processor to perform a first alignment on a trained set of HMMs to produce phone labels that are segmented further: initializes the set of HMMs; re-estimates the set of HMMs; and performs embedded re-estimation on the set of HMMs.

11. The system of claim 8 , wherein the module configured to control the processor to perform a first alignment on a trained set of HMMs to produce phone labels that are segmented further performs a Viterbi alignment on the trained set of HMMs to produce phone labels that are segmented.

12. The system of claim 10 , wherein the module configured to control the processor to perform a first alignment on a trained set of HMMs to produce phone labels that are segmented and the module configured to perform spectral boundary correction on the phone labels perform iteratively.

13. The system of claim 12 , further comprising a module configured to control the processor to train the set of HMMs using phone labels having boundaries that have been re-aligned using spectral boundary correction.

14. The system of claim 8 , wherein the module configured to control the processor to perform spectral boundary correction on the phone labels further performs spectral boundary correction on the phone labels within a context-dependent time window.

15. The system of claim 14 , further comprising a module configured to control the processor to determine empirically the context-dependent time window using adjacent phones.

16. The system of claim 14 , wherein each spectral boundary is between a first phone class and a second phone class.

17. A computing device that segments phone labels to reduce misalignments in order to improve synthetic speech when the phone labels are concatenated, the computing device comprising: a processor; a module configured to control the processor to train a set of HMMs using one of a specific speaker's hand-labeled speech data and speaker-independent speech data; a module configured to control the processor to segment the trained set of HMMs using a first alignment to produce phone labels, wherein each phone label has a spectral boundary; a module configured to control the processor to use a weighted slope metric to identify bending points of spectral transitions, wherein each bending point corresponds to a spectral boundary, a module configured to control the processor to correct a particular spectral boundary of a particular phone label if the particular spectral boundary does not coincide with a particular bending point; and a module configured to control the processor to synthesize speech using the phone labels with corrected spectral boundaries.

18. The computing device of claim 17 , wherein the module configured to control the processor to use a weighted slope metric to identify bending points of spectral transitions further applies the weighted slope metric within context-dependent time windows such that spurious spectral boundaries are not applied to the phone labels.

19. The computing device of claim 18 , further comprising a module configured to control the processor to retrain the set of HMMs using the phone labels that have been corrected using the weighted slope metric.

20. The computing device of claim 18 , wherein each spectral boundary is defined by a first phone class and a second phone class, wherein the first phone class and the second phone class include at least one of a vowel, an unvoiced stop, a voiced stop, an unvoiced fricative, a voiced fricative, a liquid class and a nasal class.

21. The computing device of claim 18 , further comprising a module configured to control the processor to determine context-dependent time windows empirically.

Patent Metadata

Filing Date

Unknown

Publication Date

September 8, 2009

Inventors

Alistair D. Conkie

Yeon-Jun Kim

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search