Legal claims defining the scope of protection, as filed with the USPTO.
1. A method of speech segment selection for use in constructing a concatenative synthesizer's database based on prosody-aligned distance measure, comprising the steps of: (A) segmenting speech stored in a speech corpus, which is recorded in advance into a plurality of speech segments according to a unit type, wherein each of the speech segments has its prosody; (B) locating pitch marks for each of the speech segments; (C) selecting one of the speech segments according to the unit type as a source segment and the remaining speech segments as target segments, and performing a prosody alignment between the source segment and each of the target segments by modifying the prosody of the source segment with a respective prosody of each of the target segments, so as to obtain a prosody-aligned source segment with respect to each of the target segments, wherein the pitch marks of the prosody-aligned source segment are time-aligned and pitch-aligned with the pitch marks of each of the target segments; (D) respectively measuring distortion between the prosody-aligned source segment and each of the target segments to obtain a distance between the prosody-aligned source segment and each of the target segments, and to obtain an average distance for the prosody-aligned source segment with respect to each of the target segments; and (E) selecting at least one speech segment previously selected as the source segment with a relatively small average distance to be used as a synthetic speech unit of the unit type for constructing the synthesizer's database.
2. The method as claimed in claim 1 , wherein in step (A), the unit type is a syllable.
3. The method as claimed in claim 1 , wherein in step (A), the speech corpus is automatically segmented into a plurality of speech segments according to a unit type by a computer.
4. The method as claimed in claim 3 , wherein the speech is segmented by using a Markov model.
5. The method as claimed in claim 1 , wherein in step (C), the prosody alignment is performed between the source segment and each target segment by using a pitch synchronous overlap-and-add (PSOLA) algorithm.
6. The method as claimed in claim 1 , wherein in step (D), the distance is D ij =dist(Ŝ i <S j >, S j ), where S i is the source segment, S j is the target segment, and Ŝ i <S j > is the waveform of the prosody-aligned source segment.
7. The method as claimed in claim 6 , wherein step (D) measures the distortion between the prosody-aligned source segment and each of the target segments by using a Mel-frequency cepstrum coefficients (MFCC) algorithm.
8. The method as claimed in claim 6 , wherein step (D) measures the distortion between the prosody-aligned source segment and each of the target segments by using a perceptual speech quality measure (PSQM) method.
9. The method as claimed in claim 6 , wherein the average distance of one speech segment S i among other speech segments is D i = 1 N - 1 ∑ j = 1 j ≠ i N D i , j , wherein N is the number of speech segments.
10. The method as claimed in claim 9 , wherein the value i of the speech segment S i can be calculated according to an inverse function of the average distance, where the inverse function is i=arg {D i }.
11. The method as claimed in claim 10 , wherein the value of i of the speech segment S i with the smallest average distance can be calculated according to the inverse function i opt = arg min i { D i } .
Unknown
January 1, 2008
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.