Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A method for processing speech splicing and synthesis, wherein the method comprises: expanding a speech library using a text for expansion and a corresponding synthesized speech which is obtained with a speech synthesis model and the text for expansion, wherein the speech library before the expansion comprises manually-collected original speeches along with corresponding original texts; and wherein the speech synthesis model is trained with the original speeches and the corresponding original texts in the speech library before the expansion; and using the expanded speech library to perform speech splicing and synthesis processing.
This invention relates to speech processing, specifically methods for expanding a speech library to improve speech splicing and synthesis. The problem addressed is the limited size and diversity of manually-collected speech libraries, which restricts the quality and flexibility of synthesized speech. The method involves expanding an existing speech library by adding synthesized speech samples. The original speech library contains manually-collected original speeches paired with their corresponding original texts. A speech synthesis model is trained using these original speeches and texts. To expand the library, new texts (referred to as "texts for expansion") are processed using the trained speech synthesis model to generate corresponding synthesized speech samples. These new text-synthesized speech pairs are then added to the original library, creating an expanded speech library. The expanded library, now containing both original and synthesized speech samples, is used for speech splicing and synthesis tasks. This approach enhances the library's size and diversity, improving the quality and naturalness of synthesized speech by leveraging both real and synthesized data. The method ensures that the synthesized speech is generated using a model trained on high-quality original data, maintaining consistency and coherence in the expanded library.
2. The method according to claim 1 , wherein the expanding a speech library using a text for expansion and a corresponding synthesized speech comprises: updating the speech library by adding the text for expansion and corresponding synthesized speech into the speech library.
This invention relates to speech synthesis and speech library expansion. The problem addressed is the need to dynamically update a speech library with new text and corresponding synthesized speech to improve the accuracy and coverage of a speech synthesis system. The method involves expanding a speech library by adding new text and its corresponding synthesized speech. The speech library is updated by incorporating the new text and synthesized speech, allowing the system to generate more accurate and natural-sounding speech for previously unencountered text. This expansion process enhances the system's ability to handle diverse linguistic inputs, improving overall speech synthesis performance. The method ensures that the speech library remains current and comprehensive, adapting to new vocabulary and phrases as needed. By continuously updating the speech library, the system can provide better speech output for a wider range of applications, such as virtual assistants, text-to-speech systems, and automated customer service. The invention focuses on efficiently integrating new text and synthesized speech into the existing library to maintain high-quality speech synthesis.
3. The method according to claim 1 , wherein the text for expansion is obtained by crawling the synthesized text from a network.
This invention relates to text expansion techniques, specifically for obtaining and processing text from network sources. The method involves crawling synthesized text available on a network, such as the internet, to gather content for expansion purposes. The synthesized text may include machine-generated or human-curated content, and the crawling process systematically retrieves this text to enhance a text expansion system. The expanded text is then used to improve natural language processing tasks, such as generating more accurate or contextually relevant responses in applications like chatbots, search engines, or content generation tools. The method ensures that the crawled text is processed to maintain quality and relevance, addressing the challenge of obtaining diverse and up-to-date text data for training or augmenting language models. By leveraging network-sourced synthesized text, the system avoids reliance on static datasets, improving adaptability to evolving language patterns and user needs. The crawling mechanism may include filtering, deduplication, or other preprocessing steps to refine the collected text before integration into the expansion process. This approach enhances the robustness and versatility of text expansion systems in dynamic digital environments.
4. The method according to claim 1 , wherein the speech synthesis model employs a WaveNetmodel.
A speech synthesis system generates high-quality audio output from text input. Traditional text-to-speech (TTS) systems often produce unnatural or robotic-sounding speech due to limitations in modeling complex acoustic features. This invention addresses the problem by using a neural network-based speech synthesis model, specifically a WaveNet model, to improve the naturalness and intelligibility of synthesized speech. The WaveNet model is a deep generative model that synthesizes raw audio waveforms by predicting the probability distribution of each audio sample conditioned on previous samples. This approach captures fine-grained acoustic details, such as prosody and voice characteristics, resulting in more human-like speech. The system may also include preprocessing steps to convert input text into a sequence of linguistic features, which are then used to condition the WaveNet model during synthesis. The use of a WaveNet model allows for high-fidelity speech generation, making it suitable for applications like virtual assistants, audiobooks, and accessibility tools. The invention improves over prior systems by leveraging advanced neural architectures to enhance speech quality.
5. The method according to claim 1 , wherein the original speeches are obtained from a same specific person having voice features, and the synthesized speech which is obtained with the speech synthesis model and the text for expansion has the same voice features as the specific person.
This invention relates to speech synthesis technology, specifically addressing the challenge of generating synthesized speech that accurately preserves the unique voice characteristics of a specific individual. The method involves obtaining original speech samples from a particular person, where these samples exhibit distinct voice features such as tone, pitch, and timbre. A speech synthesis model is then trained or adapted using these samples to ensure that the synthesized speech retains the same voice features as the original speaker. The model is further used to generate expanded or modified speech from input text, ensuring the output speech maintains the original speaker's voice characteristics. This approach is particularly useful in applications requiring personalized voice synthesis, such as virtual assistants, audiobooks, or voice cloning, where maintaining the speaker's identity is critical. The method ensures consistency in voice quality, enhancing the naturalness and authenticity of the synthesized speech. By leveraging the original speech samples, the system avoids the limitations of generic voice synthesis models, which often produce speech lacking the unique nuances of a specific individual. The invention thus provides a solution for generating high-fidelity, personalized speech synthesis.
6. The method according to claim 5 , wherein a plurality of texts for expansion and correspondingly a plurality of synthesized speeches which are obtained with the speech synthesis model and the plurality of texts are used to expand the speech library.
This invention relates to speech synthesis and library expansion techniques. The problem addressed is the limited diversity and naturalness of speech libraries used in text-to-speech (TTS) systems, which can result in unnatural or repetitive synthesized speech. The solution involves expanding the speech library by generating additional synthesized speech samples from a diverse set of input texts. The method uses a speech synthesis model to convert a plurality of texts into corresponding synthesized speech outputs. These synthesized speeches, along with their original texts, are then added to the speech library. This expansion process increases the variety of speech samples available, improving the naturalness and adaptability of the TTS system. The texts used for expansion may be selected to cover a wide range of linguistic patterns, accents, or speaking styles, ensuring the library can handle diverse input scenarios. The synthesized speeches are generated using the same speech synthesis model that will later utilize the expanded library, ensuring consistency in output quality. This approach enhances the robustness of the TTS system by providing more training data and reducing overfitting to specific speech patterns. The expanded library can then be used to improve the accuracy and naturalness of future speech synthesis tasks.
7. The method according to claim 4 , wherein the speech synthesis model is trained by inputting an original text of the original texts to the WaveNet model, and adjusting parameters of the WaveNet model according to an output of the WaveNet model and an original speech of the original speeches corresponding to the original text inputted to the WaveNet model, so as to determine parameters of the WaveNet model.
This invention relates to training a WaveNet model for speech synthesis. The problem addressed is improving the accuracy and naturalness of synthesized speech by optimizing the parameters of a WaveNet model using paired text and speech data. The method involves inputting an original text into the WaveNet model and comparing the model's output speech with the corresponding original speech. The model's parameters are then adjusted based on this comparison to minimize the difference between the synthesized and original speech. This iterative training process refines the WaveNet model's ability to generate high-quality speech from text. The WaveNet model is a deep generative model designed to produce raw audio waveforms, and its parameters are optimized through this supervised learning approach. The invention focuses on enhancing the model's performance by leveraging paired text-speech datasets to fine-tune its parameters, resulting in more natural and accurate speech synthesis.
8. The method according to claim 1 , wherein the text for expansion is a text outside the speech library before the expansion.
This invention relates to speech recognition systems and addresses the challenge of expanding the vocabulary of a speech recognition system to recognize text not originally included in its speech library. The method involves dynamically expanding the speech library by incorporating new text that was not previously part of the library. The expansion process ensures that the speech recognition system can accurately recognize and process the new text, improving its adaptability to diverse input scenarios. The method may include preprocessing the new text to optimize its integration into the speech library, such as normalizing formatting or correcting errors. Additionally, the system may verify the accuracy of the expanded text recognition through validation techniques, ensuring reliable performance. By dynamically updating the speech library, the system enhances its ability to handle previously unrecognized text, making it more versatile for real-world applications. This approach is particularly useful in environments where the input text varies frequently, such as in multilingual or domain-specific applications. The method ensures seamless integration of new text while maintaining the system's accuracy and efficiency.
9. A computer device, wherein the device comprises: one or more processors, a memory for storing one or more programs, the one or more programs, when executed by said one or more processors, enable said one or more processors to implement a method for processing speech splicing and synthesis, wherein the method comprises: expanding a speech library using a text for expansion and a corresponding synthesized speech which is obtained with speech synthesis model and the text for expansion; wherein the speech library before the expansion comprises manually-collected original speeches along with corresponding original texts; and wherein the speech synthesis model is trained with the original speeches and the corresponding original texts in the speech library before the expansion; and using the expanded speech library to perform speech splicing and synthesis processing.
This invention relates to speech processing, specifically improving speech synthesis and splicing by dynamically expanding a speech library. The problem addressed is the limited diversity and scalability of traditional speech libraries, which rely solely on manually collected speech samples, restricting the range of possible synthesized outputs. The system includes a computer device with processors and memory storing programs that implement speech splicing and synthesis. The method involves expanding a speech library by adding synthesized speech generated from a text expansion input. The speech synthesis model used for this expansion is pre-trained on the original speech samples and corresponding texts already present in the library. After expansion, the updated library is used for speech splicing and synthesis tasks, enhancing output quality and versatility. By integrating synthesized speech derived from text inputs, the system dynamically augments the library beyond manually collected samples, improving adaptability and reducing reliance on labor-intensive data collection. This approach enables more natural and varied speech synthesis while maintaining coherence in spliced outputs. The invention is particularly useful in applications requiring high-quality, scalable speech generation, such as virtual assistants, audiobooks, and real-time speech synthesis systems.
10. The computer device according to claim 9 , wherein the original speeches are obtained from a same specific person having voice features, and the synthesized speech which is obtained with the speech synthesis model and the text for expansion has the same voice features as the specific person.
This invention relates to speech synthesis technology, specifically improving the consistency and naturalness of synthesized speech by preserving the voice characteristics of a specific individual. The problem addressed is the lack of personalization in traditional speech synthesis systems, which often produce generic or robotic-sounding voices that do not accurately replicate a particular person's unique vocal features. The system involves a computer device configured to generate synthesized speech that matches the voice of a specific person. The device includes a speech synthesis model trained to convert input text into speech while maintaining the original voice characteristics. The system obtains original speech samples from the specific person, ensuring the synthesized output retains the same voice features, such as tone, pitch, and timbre. Additionally, the device may expand the input text to enhance the naturalness of the synthesized speech, further improving the overall quality and personalization of the output. The invention ensures that the synthesized speech is indistinguishable from the original speaker's voice, making it suitable for applications requiring high fidelity, such as personalized virtual assistants, audiobooks, or voice cloning for entertainment and accessibility purposes. The system leverages advanced machine learning techniques to achieve this level of accuracy, addressing the limitations of conventional text-to-speech systems that often fail to capture the nuances of an individual's voice.
11. The computer device according to claim 10 , wherein a plurality of texts for expansion and correspondingly a plurality of synthesized speeches which are obtained with the speech synthesis model and the plurality of texts are used to expand the speech library.
This invention relates to speech synthesis and speech library expansion. The problem addressed is the limited diversity and naturalness of synthesized speech in existing speech libraries, which can result in unnatural or repetitive speech outputs. The solution involves expanding a speech library by generating a plurality of synthesized speeches from a plurality of texts using a speech synthesis model. The texts are specifically selected or generated to enhance the diversity and naturalness of the speech library. The speech synthesis model processes these texts to produce corresponding synthesized speeches, which are then added to the speech library. This expansion improves the quality and variability of synthesized speech outputs, making them more natural and adaptable to different contexts. The invention ensures that the speech library is enriched with a broader range of speech samples, reducing repetition and enhancing user experience in applications such as virtual assistants, audiobooks, and other speech-based systems. The method leverages advanced speech synthesis techniques to dynamically expand the library, ensuring continuous improvement in speech quality.
12. The computer device according to claim 9 , wherein the speech synthesis model employs a WaveNetmodel.
A computer device processes audio signals to generate synthesized speech. The device includes a speech synthesis model that converts text or other input data into audio waveforms. The model uses a WaveNet architecture, which is a deep generative model designed for producing high-quality, natural-sounding speech. WaveNet leverages a deep neural network to model the probability distribution of raw audio waveforms, capturing fine-grained details such as pitch, tone, and prosody. This approach improves the realism and intelligibility of synthesized speech compared to traditional concatenative or parametric methods. The device may also include additional components, such as a text preprocessing module to prepare input data for synthesis and an audio output interface to deliver the generated speech. The WaveNet-based model is trained on large datasets of human speech to learn the intricate patterns and variations in natural language. This technology is particularly useful in applications requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and accessibility tools. The use of WaveNet enhances the naturalness and expressiveness of synthesized speech, making it more indistinguishable from human speech.
13. The computer device according to claim 12 , wherein the speech synthesis model is trained by inputting an original text of the original texts to the WaveNet model, and adjusting parameters of the WaveNet model according to an output of the WaveNet model and an original speech of the original speeches corresponding to the original text inputted to the WaveNet model, so as to determine parameters of the WaveNet model.
This invention relates to a computer device for training a speech synthesis model using a WaveNet model. The problem addressed is improving the accuracy and naturalness of synthesized speech by optimizing the training process of the WaveNet model, a deep generative model for raw audio waveforms. The computer device includes a WaveNet model configured to generate synthesized speech from input text. The training process involves inputting an original text into the WaveNet model, which generates an output waveform. The output is compared to a corresponding original speech sample, and the model's parameters are adjusted based on this comparison to minimize the difference between the synthesized and original speech. This iterative adjustment refines the model's parameters, enhancing its ability to produce high-quality speech. The device may also include a text processing module to preprocess input text before feeding it into the WaveNet model, ensuring consistency in the training data. Additionally, a speech processing module may preprocess the original speech samples to align them with the model's requirements. The training process may involve multiple iterations, with each iteration refining the model's parameters further. The invention improves speech synthesis by leveraging the WaveNet model's ability to generate high-fidelity audio waveforms, making it suitable for applications requiring natural-sounding synthetic speech, such as virtual assistants, audiobooks, and accessibility tools. The training method ensures the model learns from high-quality speech samples, resulting in more accurate and natural speech output.
14. The computer device according to claim 9 , wherein the text for expansion is a text outside the speech library before the expansion.
This invention relates to computer devices that process and expand text, particularly for speech synthesis or voice recognition systems. The problem addressed is the limited vocabulary in speech libraries, which restricts the ability to accurately synthesize or recognize text that is not pre-stored in the library. The invention provides a solution by dynamically expanding the text library with new, previously unrecognized text to improve speech processing accuracy. The computer device includes a speech library containing pre-stored text and a processor configured to identify text that is not present in the library. When such text is encountered, the processor expands the library by adding the new text, ensuring future instances of the same text can be processed correctly. The expansion process may involve analyzing the new text to determine its phonetic representation or other relevant attributes before integrating it into the library. This dynamic expansion allows the system to adapt to new vocabulary without manual updates, improving flexibility and performance in real-world applications. The invention is particularly useful in applications where the text input varies widely, such as virtual assistants, automated customer service, or real-time translation systems. By continuously updating the library, the system maintains high accuracy even when encountering previously unseen text.
15. A non-transitory computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements a method for processing speech splicing and synthesis, wherein the method comprises: expanding a speech library using a text for expansion and a corresponding synthesized speech which is obtained with a speech synthesis model and the text for expansion; wherein the speech library before the expansion comprises manually-collected original speeches along with corresponding original texts; and wherein the speech synthesis model is trained with the original speeches and the corresponding original texts in the speech library before the expansion; and using the expanded speech library to perform speech splicing and synthesis processing.
This invention relates to speech processing, specifically methods for improving speech splicing and synthesis by expanding a speech library with synthesized speech. The problem addressed is the limited size and diversity of manually-collected speech libraries, which restricts the quality and flexibility of speech synthesis and splicing applications. The method involves a speech library initially containing manually-collected original speeches paired with corresponding original texts. A speech synthesis model is trained using this original data. To expand the library, additional text for expansion is processed using the trained speech synthesis model to generate corresponding synthesized speech. This synthesized speech and its associated text are then added to the original speech library, creating an expanded version. The expanded library, now containing both original and synthesized speech, is used for subsequent speech splicing and synthesis tasks, improving output quality and versatility by leveraging both real and synthesized speech data. This approach enhances the robustness of speech processing systems by augmenting limited original datasets with model-generated content.
16. The non-transitory computer readable medium according to claim 15 , wherein the original speeches are obtained from a same specific person having voice features, and the synthesized speech which is obtained with the speech synthesis model and the text for expansion has the same voice features as the specific person.
This invention relates to speech synthesis technology, specifically addressing the challenge of generating synthesized speech that maintains the unique voice characteristics of a specific individual. The system involves obtaining original speech samples from a particular person, where these samples exhibit distinct voice features such as tone, pitch, and timbre. A speech synthesis model is trained or adapted using these samples to ensure that any synthesized speech produced by the model retains the same voice features as the original speaker. The system further includes a text expansion component that generates additional text content, which is then converted into synthesized speech using the trained model. The resulting synthesized speech not only matches the original speaker's voice characteristics but also incorporates the expanded text, enabling applications such as personalized voice assistants, audiobook narration, or voice cloning for accessibility tools. The invention ensures that the synthesized output remains consistent with the original speaker's voice, enhancing naturalness and user experience.
17. The non-transitory computer readable medium according to claim 16 , wherein a plurality of texts for expansion and correspondingly a plurality of synthesized speeches which are obtained with the speech synthesis model and the plurality of texts are used to expand the speech library.
This invention relates to speech synthesis and library expansion techniques. The problem addressed is the limited diversity and naturalness of speech libraries used in text-to-speech (TTS) systems, which can result in unnatural or repetitive synthesized speech. The solution involves expanding a speech library by generating additional synthesized speech samples from a speech synthesis model. The method uses a plurality of texts for expansion, which are processed by a speech synthesis model to generate corresponding synthesized speech samples. These synthesized speeches and their associated texts are then added to the speech library, enhancing its diversity and improving the quality of future speech synthesis. The speech synthesis model may be trained on existing data to ensure the generated samples are natural and contextually appropriate. The expanded library can then be used to improve the performance of TTS systems, making them more adaptable to different speaking styles, accents, or linguistic variations. This approach reduces the need for extensive manual recording of new speech samples, making the library expansion process more efficient and scalable.
18. The non-transitory computer readable medium according to claim 15 , wherein the speech synthesis model employs a WaveNetmodel.
A system and method for generating high-quality synthetic speech using a neural network-based speech synthesis model. The technology addresses the challenge of producing natural-sounding speech from text input, particularly in applications requiring real-time or high-fidelity audio output. The system includes a text-to-speech (TTS) pipeline that converts input text into audio waveforms using a deep learning model. The model is trained on a dataset of human speech samples to learn the acoustic properties of speech, including prosody, intonation, and phonetic details. The system may include preprocessing steps to normalize and format the input text, as well as post-processing steps to enhance the generated audio. The speech synthesis model employs a WaveNet architecture, which uses a deep neural network to generate raw audio waveforms autoregressively, producing highly realistic speech. The system may also include a user interface for adjusting synthesis parameters, such as speaking rate, pitch, and voice characteristics. The technology is applicable in virtual assistants, audiobooks, accessibility tools, and other applications requiring synthetic speech generation. The system may be implemented on a computing device with sufficient processing power to handle the computational demands of neural network-based speech synthesis.
19. The non-transitory computer readable medium according to claim 18 , wherein the speech synthesis model is trained by inputting an original text of the original texts to the WaveNet model, and adjusting parameters of the WaveNet model according to an output of the WaveNet model and an original speech of the original speeches corresponding to the original text inputted to the WaveNet model, so as to determine parameters of the WaveNet model.
This invention relates to speech synthesis using a WaveNet model, addressing the challenge of generating high-quality, natural-sounding speech from text. The system involves training a WaveNet model to produce speech that closely matches original speech samples. The training process begins by inputting an original text into the WaveNet model, which generates an output. The model's parameters are then adjusted based on a comparison between this output and the corresponding original speech sample. This iterative adjustment refines the model's parameters until they accurately replicate the original speech. The trained WaveNet model can then synthesize speech from new input texts, producing results that are indistinguishable from human speech. The invention improves upon existing text-to-speech systems by leveraging the WaveNet architecture, which excels at modeling the fine-grained details of speech waveforms. This approach ensures that synthesized speech retains natural prosody, intonation, and clarity, making it suitable for applications like virtual assistants, audiobooks, and accessibility tools. The training method optimizes the model's performance by minimizing the difference between synthesized and original speech, resulting in more realistic and coherent output.
20. The non-transitory computer readable medium according to claim 15 , wherein the text for expansion is a text outside the speech library before the expansion.
This invention relates to a system for expanding text in a speech synthesis library. The problem addressed is the limited vocabulary in speech synthesis systems, which can result in unnatural or inaccurate speech when encountering words or phrases not present in the library. The solution involves a method for dynamically expanding the speech library by processing external text to generate additional speech data. The system includes a speech library containing pre-recorded speech segments and a text expansion module. The text expansion module identifies text outside the speech library that requires expansion, such as new words or phrases. It then processes this text to generate corresponding speech data, which is added to the speech library. The expansion process may involve phonetic analysis, prosodic modeling, or other techniques to ensure the generated speech matches the natural speech patterns of the library. The invention also includes a method for selecting and prioritizing text for expansion based on factors such as frequency of use, context, or user preferences. Once expanded, the new speech data is integrated into the library, allowing the system to synthesize speech for previously unsupported text. This dynamic expansion improves the flexibility and accuracy of speech synthesis systems, particularly in applications requiring real-time or adaptive speech generation.
Unknown
October 13, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.