US-11468879

Duration informed attention network for text-to-speech analysis

PublishedOctober 11, 2022

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A method and apparatus include receiving a text input that includes a sequence of text components. Respective temporal durations of the text components are determined using a duration model. A first set of spectra is generated based on the sequence of text components. A second set of spectra is generated based on the first set of spectra and the respective temporal durations of the sequence of text components. A spectrogram frame is generated based on the second set of spectra. An audio waveform is generated based on the spectrogram frame. The audio waveform is provided as an output.

Patent Claims

15 claims

Legal claims defining the scope of protection, as filed with the USPTO.

2. The method of claim 1, wherein the phonetic text characters are phonemes.

3. The method of claim 1, wherein the phonetic text characters are characters.

4. The method of claim 1, wherein the second set of spectra comprise mel-frequency cepstrum spectra.

6. The method of claim 1, wherein the determining of the respective temporal duration of each of the phonetic text characters is based on a ground truth duration of the phonetic text characters, wherein the ground truth duration of the phonetic text characters is determined using a hidden Markov Model forced alignment technique.

7. The method of claim 1, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.

9. The device of claim 8, wherein the phonetic text characters are phonemes.

10. The device of claim 8, wherein the phonetic text characters are characters.

11. The device of claim 8, wherein the second set of spectra comprise mel-frequency cepstrum spectra.

13. The device of claim 8, wherein the determining of the respective temporal duration of each of the phonetic text characters is based on a ground truth duration of the phonetic text characters, wherein the ground truth duration of the phonetic text characters is determined using a hidden Markov Model forced alignment technique.

14. The device of claim 8, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.

16. The non-transitory computer-readable medium of claim 15, wherein the phonetic text characters are phonemes.

17. The non-transitory computer-readable medium of claim 15, wherein the phonetic text characters are characters.

18. The non-transitory computer-readable medium of claim 15, wherein the second set of spectra comprise mel-frequency cepstrum spectra.

19. The non-transitory computer-readable medium of claim 15, wherein the second set of spectra includes a different number of spectra than as compared to the first set of spectra.

20. The non-transitory computer-readable medium of claim 15, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G10L

Patent Metadata

Filing Date

April 29, 2019

Publication Date

October 11, 2022

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search