Patentable/Patents/US-11468879
US-11468879

Duration informed attention network for text-to-speech analysis

PublishedOctober 11, 2022
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

A method and apparatus include receiving a text input that includes a sequence of text components. Respective temporal durations of the text components are determined using a duration model. A first set of spectra is generated based on the sequence of text components. A second set of spectra is generated based on the first set of spectra and the respective temporal durations of the sequence of text components. A spectrogram frame is generated based on the second set of spectra. An audio waveform is generated based on the spectrogram frame. The audio waveform is provided as an output.

Patent Claims
15 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 2

Original Legal Text

2. The method of claim 1, wherein the phonetic text characters are phonemes.

Plain English translation pending...
Claim 3

Original Legal Text

3. The method of claim 1, wherein the phonetic text characters are characters.

Plain English translation pending...
Claim 4

Original Legal Text

4. The method of claim 1, wherein the second set of spectra comprise mel-frequency cepstrum spectra.

Plain English Translation

The invention relates to signal processing, specifically to methods for analyzing spectral data to improve accuracy in identifying or classifying signals. The problem addressed involves enhancing the robustness and precision of spectral analysis techniques, particularly in noisy or complex environments where traditional methods may fail to distinguish relevant features. The method involves processing a first set of spectra to extract features, then applying a second set of spectra derived from the first set to refine the analysis. The second set of spectra are mel-frequency cepstrum spectra, which transform the original spectral data into a form that emphasizes perceptually relevant acoustic features. This transformation helps in better capturing the distinguishing characteristics of signals, such as speech or audio, by mapping frequencies in a nonlinear manner that aligns with human auditory perception. The mel-frequency cepstrum coefficients provide a compact representation that improves discrimination between similar signals, making the analysis more reliable in applications like speech recognition, audio classification, or signal identification. By using mel-frequency cepstrum spectra, the method enhances the ability to extract meaningful patterns from spectral data, leading to improved accuracy in tasks requiring spectral analysis. This approach is particularly useful in fields where signal clarity is critical, such as telecommunications, audio processing, and machine learning-based signal recognition systems.

Claim 6

Original Legal Text

6. The method of claim 1, wherein the determining of the respective temporal duration of each of the phonetic text characters is based on a ground truth duration of the phonetic text characters, wherein the ground truth duration of the phonetic text characters is determined using a hidden Markov Model forced alignment technique.

Plain English translation pending...
Claim 7

Original Legal Text

7. The method of claim 1, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.

Plain English Translation

This invention relates to speech processing, specifically aligning spectrogram frames with text input to improve speech synthesis or recognition. The problem addressed is ensuring that the temporal alignment of acoustic features in a spectrogram matches the corresponding text sequence, which is critical for accurate speech synthesis or transcription. The method involves generating a spectrogram from an audio signal, where the spectrogram is divided into frames representing short-time spectral features. A first set of spectra is derived from the spectrogram, and a second set of spectra is generated by modifying the first set to align with the text input. This alignment ensures that the spectrogram frames correspond to the correct phonetic or linguistic units in the text. The modification may involve time-warping, spectral adjustments, or other transformations to match the text's phonetic structure. The alignment process may use techniques such as dynamic time warping (DTW) or hidden Markov models (HMMs) to map the spectrogram frames to the text sequence. The method ensures that the acoustic features in the spectrogram are synchronized with the text, improving the accuracy of speech synthesis or recognition systems. This alignment is particularly useful in applications like text-to-speech (TTS) systems, automatic speech recognition (ASR), and voice conversion, where precise timing between acoustic features and text is essential.

Claim 9

Original Legal Text

9. The device of claim 8, wherein the phonetic text characters are phonemes.

Plain English translation pending...
Claim 10

Original Legal Text

10. The device of claim 8, wherein the phonetic text characters are characters.

Plain English Translation

The invention relates to a device for processing phonetic text characters, addressing the challenge of accurately representing and converting spoken language into written form. The device includes a phonetic text input system that captures spoken language and converts it into phonetic text characters, which are symbols representing the sounds of speech rather than conventional written words. These phonetic text characters are then processed by a conversion module that translates them into a target language or format, such as standard written text or another phonetic representation. The device may also include a display system to present the converted text or phonetic characters to the user. Additionally, the device may incorporate a feedback mechanism to refine the accuracy of the phonetic-to-text conversion based on user input or predefined rules. The system ensures that spoken language is accurately transcribed and converted, improving communication for applications such as language learning, transcription services, or assistive technologies. The phonetic text characters may be stored or transmitted for further processing, enabling seamless integration with other systems or devices. The invention enhances the reliability and efficiency of speech-to-text conversion by leveraging phonetic representations, reducing errors in transcription and improving user experience.

Claim 11

Original Legal Text

11. The device of claim 8, wherein the second set of spectra comprise mel-frequency cepstrum spectra.

Plain English translation pending...
Claim 13

Original Legal Text

13. The device of claim 8, wherein the determining of the respective temporal duration of each of the phonetic text characters is based on a ground truth duration of the phonetic text characters, wherein the ground truth duration of the phonetic text characters is determined using a hidden Markov Model forced alignment technique.

Plain English Translation

This invention relates to speech synthesis systems, specifically improving the naturalness of synthesized speech by accurately determining the temporal duration of phonetic text characters. The problem addressed is the unnatural rhythm and timing in synthesized speech, which occurs when phonetic elements are not properly aligned with their natural speaking durations. The solution involves using a hidden Markov Model (HMM) forced alignment technique to derive ground truth durations for phonetic text characters. These ground truth durations are then applied to adjust the timing of synthesized speech, ensuring that each phonetic character is rendered for its correct duration. The HMM forced alignment technique involves comparing a reference audio recording of natural speech with its corresponding phonetic transcription to statistically model the duration of each phonetic unit. This model is then used to predict the optimal duration for phonetic characters in synthesized speech, resulting in more natural and intelligible output. The invention enhances existing text-to-speech systems by incorporating precise phonetic timing derived from real speech data, reducing robotic or unnatural speech artifacts.

Claim 14

Original Legal Text

14. The device of claim 8, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.

Plain English translation pending...
Claim 16

Original Legal Text

16. The non-transitory computer-readable medium of claim 15, wherein the phonetic text characters are phonemes.

Plain English translation pending...
Claim 17

Original Legal Text

17. The non-transitory computer-readable medium of claim 15, wherein the phonetic text characters are characters.

Plain English Translation

A system and method for processing phonetic text characters in a computing environment. The technology addresses the challenge of accurately representing and manipulating phonetic text, which is often used in language learning, speech recognition, and linguistic analysis. The system includes a computer-readable medium storing instructions that, when executed, perform operations for generating, displaying, and processing phonetic text characters. These characters are used to represent sounds in a standardized format, such as the International Phonetic Alphabet (IPA), to facilitate precise pronunciation and linguistic analysis. The system may include a user interface for inputting and editing phonetic text, as well as a processing module that converts between phonetic and standard text formats. The system may also support phonetic text analysis, such as identifying phonetic patterns or comparing phonetic representations across different languages. The invention improves the accuracy and efficiency of phonetic text handling in applications requiring precise sound representation.

Claim 18

Original Legal Text

18. The non-transitory computer-readable medium of claim 15, wherein the second set of spectra comprise mel-frequency cepstrum spectra.

Plain English translation pending...
Claim 19

Original Legal Text

19. The non-transitory computer-readable medium of claim 15, wherein the second set of spectra includes a different number of spectra than as compared to the first set of spectra.

Plain English translation pending...
Claim 20

Original Legal Text

20. The non-transitory computer-readable medium of claim 15, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.

Plain English translation pending...
Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

April 29, 2019

Publication Date

October 11, 2022

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Duration informed attention network for text-to-speech analysis” (US-11468879). https://patentable.app/patents/US-11468879

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/US-11468879. See llms.txt for full attribution policy.