A method and apparatus include receiving a text input that includes a sequence of text components. Respective temporal durations of the text components are determined using a duration model. A first set of spectra is generated based on the sequence of text components. A second set of spectra is generated based on the first set of spectra and the respective temporal durations of the sequence of text components. A spectrogram frame is generated based on the second set of spectra. An audio waveform is generated based on the spectrogram frame. The audio waveform is provided as an output.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
2. The method of claim 1, wherein the phonetic text characters are phonemes.
3. The method of claim 1, wherein the phonetic text characters are characters.
4. The method of claim 1, wherein the second set of spectra comprise mel-frequency cepstrum spectra.
The invention relates to signal processing, specifically to methods for analyzing spectral data to improve accuracy in identifying or classifying signals. The problem addressed involves enhancing the robustness and precision of spectral analysis techniques, particularly in noisy or complex environments where traditional methods may fail to distinguish relevant features. The method involves processing a first set of spectra to extract features, then applying a second set of spectra derived from the first set to refine the analysis. The second set of spectra are mel-frequency cepstrum spectra, which transform the original spectral data into a form that emphasizes perceptually relevant acoustic features. This transformation helps in better capturing the distinguishing characteristics of signals, such as speech or audio, by mapping frequencies in a nonlinear manner that aligns with human auditory perception. The mel-frequency cepstrum coefficients provide a compact representation that improves discrimination between similar signals, making the analysis more reliable in applications like speech recognition, audio classification, or signal identification. By using mel-frequency cepstrum spectra, the method enhances the ability to extract meaningful patterns from spectral data, leading to improved accuracy in tasks requiring spectral analysis. This approach is particularly useful in fields where signal clarity is critical, such as telecommunications, audio processing, and machine learning-based signal recognition systems.
6. The method of claim 1, wherein the determining of the respective temporal duration of each of the phonetic text characters is based on a ground truth duration of the phonetic text characters, wherein the ground truth duration of the phonetic text characters is determined using a hidden Markov Model forced alignment technique.
7. The method of claim 1, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.
This invention relates to speech processing, specifically aligning spectrogram frames with text input to improve speech synthesis or recognition. The problem addressed is ensuring that the temporal alignment of acoustic features in a spectrogram matches the corresponding text sequence, which is critical for accurate speech synthesis or transcription. The method involves generating a spectrogram from an audio signal, where the spectrogram is divided into frames representing short-time spectral features. A first set of spectra is derived from the spectrogram, and a second set of spectra is generated by modifying the first set to align with the text input. This alignment ensures that the spectrogram frames correspond to the correct phonetic or linguistic units in the text. The modification may involve time-warping, spectral adjustments, or other transformations to match the text's phonetic structure. The alignment process may use techniques such as dynamic time warping (DTW) or hidden Markov models (HMMs) to map the spectrogram frames to the text sequence. The method ensures that the acoustic features in the spectrogram are synchronized with the text, improving the accuracy of speech synthesis or recognition systems. This alignment is particularly useful in applications like text-to-speech (TTS) systems, automatic speech recognition (ASR), and voice conversion, where precise timing between acoustic features and text is essential.
9. The device of claim 8, wherein the phonetic text characters are phonemes.
10. The device of claim 8, wherein the phonetic text characters are characters.
The invention relates to a device for processing phonetic text characters, addressing the challenge of accurately representing and converting spoken language into written form. The device includes a phonetic text input system that captures spoken language and converts it into phonetic text characters, which are symbols representing the sounds of speech rather than conventional written words. These phonetic text characters are then processed by a conversion module that translates them into a target language or format, such as standard written text or another phonetic representation. The device may also include a display system to present the converted text or phonetic characters to the user. Additionally, the device may incorporate a feedback mechanism to refine the accuracy of the phonetic-to-text conversion based on user input or predefined rules. The system ensures that spoken language is accurately transcribed and converted, improving communication for applications such as language learning, transcription services, or assistive technologies. The phonetic text characters may be stored or transmitted for further processing, enabling seamless integration with other systems or devices. The invention enhances the reliability and efficiency of speech-to-text conversion by leveraging phonetic representations, reducing errors in transcription and improving user experience.
11. The device of claim 8, wherein the second set of spectra comprise mel-frequency cepstrum spectra.
13. The device of claim 8, wherein the determining of the respective temporal duration of each of the phonetic text characters is based on a ground truth duration of the phonetic text characters, wherein the ground truth duration of the phonetic text characters is determined using a hidden Markov Model forced alignment technique.
This invention relates to speech synthesis systems, specifically improving the naturalness of synthesized speech by accurately determining the temporal duration of phonetic text characters. The problem addressed is the unnatural rhythm and timing in synthesized speech, which occurs when phonetic elements are not properly aligned with their natural speaking durations. The solution involves using a hidden Markov Model (HMM) forced alignment technique to derive ground truth durations for phonetic text characters. These ground truth durations are then applied to adjust the timing of synthesized speech, ensuring that each phonetic character is rendered for its correct duration. The HMM forced alignment technique involves comparing a reference audio recording of natural speech with its corresponding phonetic transcription to statistically model the duration of each phonetic unit. This model is then used to predict the optimal duration for phonetic characters in synthesized speech, resulting in more natural and intelligible output. The invention enhances existing text-to-speech systems by incorporating precise phonetic timing derived from real speech data, reducing robotic or unnatural speech artifacts.
14. The device of claim 8, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.
16. The non-transitory computer-readable medium of claim 15, wherein the phonetic text characters are phonemes.
17. The non-transitory computer-readable medium of claim 15, wherein the phonetic text characters are characters.
A system and method for processing phonetic text characters in a computing environment. The technology addresses the challenge of accurately representing and manipulating phonetic text, which is often used in language learning, speech recognition, and linguistic analysis. The system includes a computer-readable medium storing instructions that, when executed, perform operations for generating, displaying, and processing phonetic text characters. These characters are used to represent sounds in a standardized format, such as the International Phonetic Alphabet (IPA), to facilitate precise pronunciation and linguistic analysis. The system may include a user interface for inputting and editing phonetic text, as well as a processing module that converts between phonetic and standard text formats. The system may also support phonetic text analysis, such as identifying phonetic patterns or comparing phonetic representations across different languages. The invention improves the accuracy and efficiency of phonetic text handling in applications requiring precise sound representation.
18. The non-transitory computer-readable medium of claim 15, wherein the second set of spectra comprise mel-frequency cepstrum spectra.
19. The non-transitory computer-readable medium of claim 15, wherein the second set of spectra includes a different number of spectra than as compared to the first set of spectra.
20. The non-transitory computer-readable medium of claim 15, wherein an alignment of frames in the spectrogram frame based on the second set of spectra replicates an alignment of the text input.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
April 29, 2019
October 11, 2022
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.