This speech processing device is provided with: a contribution degree estimation means which calculates a contribution degree representing a quality of a segment of the speech signal; and a speaker feature calculation means which calculates a feature from the speech signal, for recognizing attribute information of the speech signal, using the contribution degree as a weight of the segment of the speech signal.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A speech processing device, comprising: a processor; and memory storing executable instructions that, when executed by the processor, causes the processor to perform as: a contribution degree estimation unit configured to calculate a contribution degree representing a quality of a segment of a speech signal indicative of speech, the segment of the speech signal being divided into silence segments and speech segments, the quality in the speech segments being classified into a speech sound leading to a correct solution in speaker recognition and a speech sound causing an error in the speaker recognition; and a speaker feature calculation unit configured to calculate a speaker feature from the speech signal, for recognizing attribute information of the speech signal, using the contribution degree as a weight of the segment of the speech signal, the speaker feature being indicative of individuality for identifying a speaker which utters the speech.
This invention relates to speech processing for speaker recognition, addressing the challenge of accurately identifying speakers despite variations in speech quality. The device includes a processor and memory storing instructions for two key functions. First, a contribution degree estimation unit analyzes a speech signal, dividing it into silence and speech segments. Within the speech segments, it classifies sounds into those that aid correct speaker recognition and those that cause errors. The unit calculates a contribution degree for each segment, quantifying its quality and reliability for recognition. Second, a speaker feature calculation unit extracts speaker-specific features from the speech signal, using the contribution degree as a weighting factor. This ensures that only high-quality, reliable speech segments influence the final speaker identification. The resulting speaker feature represents the individuality of the speaker, enabling accurate recognition of the person speaking. This approach improves speaker recognition by prioritizing speech segments that contribute positively to identification while minimizing the impact of error-prone segments.
2. The speech processing device as claimed in claim 1 , wherein the processor further performs as a speech statistic calculation unit configured to calculate a speech statistic representing a degree of appearance of each of types of sounds included in the speech signal, and wherein the speaker feature calculation unit is configured to calculate the speaker feature on the basis of the speech statistic of the speech signal and the contribution degree of the speech signal.
This invention relates to speech processing devices that analyze speech signals to extract speaker features for identification or recognition. The problem addressed is improving the accuracy of speaker feature extraction by accounting for variations in speech content and background noise. The device includes a processor that calculates a speech statistic representing the frequency or prominence of different sound types in the speech signal. This statistic quantifies how often certain phonemes, words, or acoustic features appear. The processor also determines a contribution degree, which measures how much each segment of the speech signal contributes to the overall speaker feature. The speaker feature calculation unit then generates a speaker feature vector or model by combining the speech statistic and the contribution degree. This ensures that the extracted features are robust to variations in speech content and background conditions, enhancing speaker recognition performance. The invention improves upon prior systems by dynamically weighting speech segments based on their relevance to the speaker's unique characteristics, rather than treating all segments equally. This approach is particularly useful in noisy environments or when processing speech with varying linguistic content.
3. The speech processing device as claimed in claim 1 , wherein the contribution degree estimation unit is configured to calculate, as the contribution degree of the speech signal, at least any one selected from the group consisting of: a probability representing the degree that the segment of the speech signal is the speech, calculated by a classifier which distinguishes speech signal from non-speech signal; a probability representing the degree that the segment of the speech signal leads to a correct determination in the speaker recognition, calculated by a classifier which distinguishes correctly recognized speech signal from the other speech signal; and a probability representing the degree that the segment of the speech signal causes an error in the speaker recognition, calculated by a classifier which distinguishes misrecognized speech signal from the other speech signal.
This invention relates to speech processing devices, specifically improving speaker recognition accuracy by evaluating the contribution of speech signal segments. The problem addressed is the variability in speech signals, where certain segments may degrade recognition performance due to noise, background interference, or other factors. The device includes a contribution degree estimation unit that calculates the reliability of speech segments for speaker recognition. This unit computes at least one of three probabilities: the likelihood that a segment is speech (distinguishing speech from non-speech), the likelihood that a segment leads to correct speaker recognition (distinguishing correctly recognized speech from other speech), or the likelihood that a segment causes recognition errors (distinguishing misrecognized speech from other speech). These probabilities are derived using classifiers trained to differentiate between these categories. By quantifying the contribution of each segment, the device can enhance speaker recognition accuracy by selectively weighting or filtering segments based on their reliability. This approach improves robustness in noisy environments and reduces false positives or negatives in speaker verification or identification tasks. The invention is particularly useful in applications requiring high-accuracy speaker recognition, such as security systems, voice assistants, or biometric authentication.
4. The speech processing device as claimed in claim 3 , wherein the contribution degree estimation unit is configured to calculate the contribution degree of the speech signal by using a neural network.
The invention relates to speech processing devices designed to enhance speech signals by estimating and utilizing the contribution degree of individual speech signals in a mixed audio environment. The problem addressed is the difficulty in accurately separating and processing overlapping speech signals, particularly in noisy or multi-speaker scenarios, where traditional methods struggle to distinguish between relevant and irrelevant speech components. The speech processing device includes a contribution degree estimation unit that calculates the contribution degree of a speech signal using a neural network. This neural network is trained to analyze the input speech signal and determine its relative importance or relevance in the context of the overall audio mixture. The neural network processes features extracted from the speech signal, such as spectral or temporal characteristics, to generate a contribution degree value. This value quantifies how much the speech signal contributes to the desired output, enabling the device to prioritize or filter speech components accordingly. The device may also include a speech separation unit that uses the estimated contribution degree to separate or enhance the target speech signal from background noise or other interfering speech signals. The neural network-based approach improves accuracy and adaptability compared to traditional methods, allowing the device to handle complex audio environments more effectively. The invention is particularly useful in applications like speech recognition, teleconferencing, and hearing aids, where clear and accurate speech extraction is critical.
5. The speech processing device as claimed in claim 1 , wherein the speaker feature calculation unit is configured to calculate an i-vector as the speaker feature.
This invention relates to speech processing devices designed to extract and analyze speaker features from audio signals. The primary problem addressed is the need for accurate and efficient speaker identification or verification in various applications, such as voice authentication, speaker diarization, or speech recognition systems. Traditional methods often struggle with robustness in noisy environments or with limited training data. The device includes a speaker feature calculation unit that processes audio input to derive a compact yet discriminative representation of the speaker's voice. Specifically, the unit calculates an i-vector (identity vector), a widely used statistical model in speaker recognition that captures speaker-specific characteristics by embedding them into a low-dimensional vector space. The i-vector approach improves upon earlier methods like Gaussian Mixture Models (GMMs) by reducing computational complexity and enhancing accuracy, particularly in scenarios with varying acoustic conditions. The device may also include additional components, such as a feature extraction unit that converts raw audio into spectral features (e.g., Mel-Frequency Cepstral Coefficients) and a dimensionality reduction module that prepares the data for i-vector extraction. The i-vector calculation involves statistical modeling techniques, such as factor analysis, to separate speaker-specific information from channel and session variability. This enables reliable speaker identification even when the same speaker's voice is recorded under different conditions. The invention aims to provide a scalable and efficient solution for speaker recognition tasks, leveraging advanced statistical modeling to improve accuracy and adaptability in real-world applications.
6. The speech processing device as claimed in claim 1 , wherein the processor further performs as an attribute recognition unit configured to recognize the attribute information on the basis of the speaker feature.
This invention relates to speech processing devices designed to analyze and interpret spoken language, particularly focusing on extracting attribute information from speaker features. The core problem addressed is the need to accurately identify and categorize speaker attributes, such as age, gender, or emotional state, from audio input. Traditional systems often struggle with robustness and precision in attribute recognition, especially in noisy environments or with diverse speaker characteristics. The device includes a processor that functions as an attribute recognition unit. This unit is specifically configured to analyze speaker features—such as pitch, tone, or speech patterns—to derive attribute information. The processor may employ machine learning models, signal processing techniques, or statistical analysis to interpret these features and output relevant attributes. The system may also integrate with other components, such as audio input modules or data storage, to enhance recognition accuracy. By leveraging advanced algorithms, the device aims to provide reliable attribute extraction for applications like voice authentication, personalized user interfaces, or speech analytics. The focus is on improving the precision and adaptability of attribute recognition in real-world scenarios.
7. The speech processing device as claimed in claim 1 , wherein the attribute information of the speech signal comprises information indicative of at least any one selected from the group consisting of the speaker of the speech signal; a language spoken in the speech signal; an emotion included in the speech signal; and a type of personality of the speaker of the speech signal.
This invention relates to speech processing devices designed to analyze and extract attribute information from speech signals. The primary problem addressed is the need to identify and categorize various characteristics embedded within speech, such as speaker identity, language, emotional content, and personality traits. Traditional speech processing systems often focus on transcription or basic recognition, lacking the ability to discern these nuanced attributes. The device processes speech signals to extract attribute information, which includes at least one of the following: speaker identification, language detection, emotion recognition, or personality type analysis. Speaker identification involves determining who is speaking, while language detection distinguishes the spoken language. Emotion recognition analyzes the emotional tone or sentiment in the speech, and personality type analysis assesses traits such as extroversion or introversion based on speech patterns. These attributes are derived from acoustic features, prosodic elements, and linguistic cues within the speech signal. The device enhances applications like voice authentication, customer service analytics, mental health monitoring, and personalized user interactions by providing deeper insights into speech beyond mere content. This capability improves accuracy in tasks requiring contextual understanding, such as adaptive responses in virtual assistants or sentiment-driven decision-making in automated systems.
8. A speech processing method comprising: calculating a contribution degree representing a quality of a segment of a speech signal indicative of speech, the segment of the speech signal being divided into silence segments and speech segments, the quality in the speech segments being classified into a speech sound leading to a correct solution in speaker recognition and a speech sound causing an error in the speaker recognition; and calculating a speaker feature from the speech signal, for recognizing attribute information of the speech signal, using the contribution degree as a weight of the segment of the speech signal, the speaker feature being indicative of individuality for identifying a speaker which utters the speech.
This invention relates to speech processing for improving speaker recognition accuracy. The method addresses the problem of unreliable speaker identification due to variations in speech quality, where certain speech segments may lead to recognition errors while others contribute positively to accurate identification. The method involves analyzing a speech signal by dividing it into segments, classifying them as either silence or speech segments, and further categorizing the speech segments based on their quality. The quality classification distinguishes between speech sounds that lead to correct speaker recognition and those that cause errors. A contribution degree is calculated for each speech segment, representing its quality and reliability for speaker recognition. Using this contribution degree as a weighting factor, the method then extracts speaker features from the speech signal. These features represent the individual characteristics of the speaker's voice, which are used to identify the speaker. By weighting the segments based on their contribution degree, the method enhances the accuracy of speaker recognition by emphasizing high-quality speech segments and deemphasizing or excluding unreliable segments. This approach improves the robustness of speaker recognition systems in noisy or variable speech conditions.
9. The speech processing method as claimed in claim 8 , further comprising: calculating a speech statistic representing a degree of appearance of each of types of sounds included in the speech signal; and calculating the feature on the basis of the speech statistic of the speech signal and the contribution degree of the speech signal.
This invention relates to speech processing, specifically improving speech recognition or analysis by incorporating statistical features of speech signals. The method addresses the challenge of accurately representing and distinguishing different types of sounds in speech, which is critical for applications like voice recognition, speech synthesis, and audio analysis. The method involves calculating a speech statistic that quantifies the occurrence or prominence of various sound types within a speech signal. This statistic helps identify which sounds are most prevalent or significant in the signal. Additionally, the method computes a feature value based on this speech statistic and a predefined contribution degree, which indicates the relative importance or influence of the speech signal in the context of the application. By combining these elements, the method enhances the accuracy of speech processing tasks by providing a more nuanced representation of the speech content. The invention builds on prior techniques by integrating statistical analysis of sound types with contribution-based weighting, allowing for more refined speech feature extraction. This approach is particularly useful in noisy environments or when processing speech from diverse speakers, as it adaptively adjusts to variations in sound patterns. The method can be applied in real-time systems or offline processing, depending on the requirements of the application.
10. A non-transitory computer readable recording medium for storing a speech processing program for causing a computer to execute: a process for calculating a contribution degree representing a quality of a segment of a speech signal indicative of speech, the segment of the speech signal being divided into silence segments and speech segments, the quality in the speech segments being classified into a speech sound leading to a correct solution in speaker recognition and a speech sound causing an error in the speaker recognition; and a process for calculating a speaker feature from the speech signal, for recognizing attribute information of the speech signal, using the contribution degree as a weight of the segment of the speech signal, the speaker feature being indicative of individuality for identifying a speaker which utters the speech.
This invention relates to speech processing for speaker recognition, addressing the challenge of accurately identifying speakers by distinguishing between speech segments that contribute positively to recognition and those that may introduce errors. The system processes a speech signal by dividing it into silence and speech segments, then evaluates the quality of each speech segment. The quality assessment classifies speech sounds into two categories: those that lead to correct speaker recognition and those that cause errors. A contribution degree is calculated for each segment, representing its reliability in speaker identification. This contribution degree is then used as a weighting factor when extracting speaker features from the speech signal. The speaker features, which represent the individuality of the speaker, are derived using the weighted segments to enhance recognition accuracy. By prioritizing high-quality speech segments and de-emphasizing unreliable ones, the system improves the robustness of speaker recognition systems, particularly in noisy or variable speech conditions. The invention is implemented as a computer program stored on a non-transitory medium, executing these processes to enhance speaker verification or identification performance.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
March 7, 2017
February 15, 2022
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.