10839810

Speaker Enrollment

PublishedNovember 17, 2020
Assigneenot available in USPTO data we have
InventorsRahim SAEIDI
Technical Abstract

Patent Claims
13 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A method of speaker modelling for a speaker recognition system, comprising: receiving a signal comprising a speaker's speech; and, for a plurality of frames of the signal: obtaining a spectrum of the speaker's speech; generating at least one modified spectrum, by applying effects related to a respective vocal effort, wherein the step of generating at least one modified spectrum comprises: determining a frequency and a bandwidth of at least one formant component of the speaker's speech; generating at least one modified formant component by modifying at least one of the frequency and the bandwidth of the or each formant component; and generating the modified spectrum from the or each modified formant component; and extracting features from the spectrum of the speaker's speech and the at least one modified spectrum; and forming at least one speech model based on the extracted features.

Plain English Translation

This invention relates to speaker recognition systems, specifically improving speaker modeling by accounting for variations in vocal effort. The problem addressed is the challenge of accurately recognizing speakers when their vocal effort (e.g., loudness, intensity) varies, which can alter speech characteristics like formant frequencies and bandwidths, leading to recognition errors. The method involves receiving a speech signal and processing it frame-by-frame. For each frame, the spectrum of the speaker's speech is obtained. At least one modified spectrum is generated by applying effects related to vocal effort. This includes determining the frequency and bandwidth of formant components (key spectral features of speech) and modifying these parameters to simulate different vocal efforts. The modified formant components are then used to generate the modified spectrum. Features are extracted from both the original and modified spectra, and these features are used to form a speech model. By incorporating variations due to vocal effort, the model becomes more robust to changes in speaking intensity, improving recognition accuracy. The approach enhances speaker recognition by explicitly modeling vocal effort variations, ensuring consistent performance across different speaking conditions.

Claim 2

Original Legal Text

2. A method according to claim 1 , comprising: obtaining the spectrum of the speaker's speech for a plurality of frames of the signal containing voiced speech.

Plain English Translation

This invention relates to speech processing, specifically analyzing voiced speech signals to extract spectral information. The method involves capturing a speech signal containing voiced segments and processing it to obtain spectral data for multiple frames of the signal. The spectral analysis is performed on each frame to derive frequency-domain representations, which can be used for tasks such as speaker recognition, speech synthesis, or voice authentication. The technique focuses on voiced speech, where vocal cords vibrate, producing periodic waveforms with distinct spectral characteristics. By analyzing these frames, the method enables detailed examination of pitch, formants, and other spectral features that are critical for accurate speech analysis. The approach improves upon existing methods by providing a structured way to extract and utilize spectral information from voiced segments, enhancing applications that rely on precise speech signal analysis. The method can be integrated into systems requiring real-time or offline speech processing, offering a robust framework for spectral feature extraction in voiced speech.

Claim 3

Original Legal Text

3. A method according to claim 1 , comprising: obtaining the spectrum of the speaker's speech for a plurality of overlapping frames of the signal.

Plain English Translation

A method for speech processing involves analyzing the spectral characteristics of a speaker's voice to enhance speech recognition or other audio applications. The method focuses on capturing and processing the spectrum of the speaker's speech across multiple overlapping frames of the audio signal. By dividing the speech signal into these overlapping frames, the method ensures continuous and detailed spectral analysis, which helps in accurately representing the dynamic changes in the speaker's voice. This approach is particularly useful in applications where precise spectral information is required, such as noise reduction, speaker identification, or speech synthesis. The overlapping frames allow for smoother transitions between spectral representations, reducing artifacts that can occur with non-overlapping frames. The method may be used in conjunction with other speech processing techniques to improve overall performance in real-time or offline speech analysis systems. The spectral data obtained from the overlapping frames can be further processed to extract features like formants, pitch, or spectral envelopes, which are critical for various speech-related tasks. This technique is beneficial in environments where speech signals are subject to background noise or other distortions, as the overlapping frames provide a more robust representation of the speech signal.

Claim 4

Original Legal Text

4. A method according to claim 1 , wherein each frame has a duration between 10 ms and 50 ms.

Plain English Translation

This invention relates to a method for processing data frames in a communication system, particularly focusing on optimizing frame duration to improve efficiency and performance. The method involves transmitting data in discrete frames, where each frame has a duration between 10 milliseconds (ms) and 50 ms. The frame duration is selected to balance latency and overhead, ensuring timely data transmission while minimizing resource usage. The method may include generating frames, encoding data within each frame, and transmitting the frames over a communication channel. The frame duration setting helps manage synchronization, error correction, and bandwidth utilization, making it suitable for real-time applications such as video streaming, voice communication, or sensor data transmission. By controlling the frame duration within the specified range, the method ensures compatibility with various network conditions and device capabilities, enhancing reliability and reducing transmission delays. The invention addresses the challenge of optimizing frame timing to achieve efficient data transfer without compromising performance or increasing complexity.

Claim 5

Original Legal Text

5. A method according to claim 1 , comprising: generating a plurality of modified spectra, by applying effects related to respective vocal efforts.

Plain English Translation

This invention relates to audio signal processing, specifically modifying vocal recordings to simulate different vocal efforts. The problem addressed is the lack of natural variation in recorded speech, where a single vocal effort (e.g., normal, loud, or soft) may not adequately convey emotional or situational nuances. The solution involves generating multiple modified spectra from an original vocal recording by applying effects that simulate different vocal efforts, such as changes in loudness, pitch, or timbre. These modifications create a range of vocal expressions that can be used in applications like speech synthesis, voice conversion, or audio post-production. The method ensures that the modified spectra retain the original speech content while introducing realistic variations in vocal effort, enhancing the naturalness and expressiveness of the processed audio. This approach is particularly useful in scenarios where dynamic vocal expressions are required, such as in virtual assistants, voice acting, or audiobooks. The invention improves upon existing techniques by providing a systematic way to generate diverse vocal effort variations without requiring multiple recordings or complex manual adjustments.

Claim 6

Original Legal Text

6. A method according to claim 1 , wherein the step of forming at least one speech model comprises forming a background model for the speaker recognition system, based in part on said speaker's speech.

Plain English Translation

This invention relates to speaker recognition systems, specifically improving the accuracy of such systems by forming a background model based on a speaker's speech. Speaker recognition systems often struggle with distinguishing between authorized users and impersonators due to insufficient background data representing the speaker's natural speech patterns. The invention addresses this by creating a background model that incorporates the speaker's own speech characteristics, enhancing the system's ability to differentiate between authorized and unauthorized users. The background model is derived from the speaker's speech data, allowing the system to better account for variations in the speaker's voice, such as accents, speech patterns, or environmental factors. This approach improves the robustness of the speaker recognition system, reducing false acceptances and rejections. The method involves analyzing the speaker's speech to extract relevant features, then using these features to construct a background model that serves as a reference for future authentication attempts. By leveraging the speaker's own speech data, the system achieves higher accuracy in verifying the speaker's identity. This technique is particularly useful in security applications where reliable speaker recognition is critical.

Claim 7

Original Legal Text

7. A method according to claim 1 , comprising determining a frequency and a bandwidth of a number of formant components of the speaker's speech in the range from 3-5.

Plain English Translation

This invention relates to speech processing, specifically analyzing the acoustic characteristics of a speaker's voice to extract formant frequencies and bandwidths. Formants are resonant frequencies of the vocal tract that define the timbre and intelligibility of speech. The invention addresses the challenge of accurately identifying and quantifying these formant components, which are critical for applications such as speech recognition, voice synthesis, and speaker identification. The method involves determining the frequency and bandwidth of multiple formant components within a specific range (3-5) of the speaker's speech. The analysis focuses on extracting these parameters to characterize the speaker's vocal characteristics. The formant components are derived from the speech signal, typically through spectral analysis techniques such as linear predictive coding (LPC) or cepstral analysis. The frequency of each formant indicates the peak resonance, while the bandwidth represents the width of the resonance peak, which can vary due to factors like vocal tract shape and articulation. By quantifying these parameters, the method enables precise modeling of speech production, improving the accuracy of speech-related technologies. The extracted formant data can be used to enhance speech synthesis by generating more natural-sounding voices, improve speech recognition by distinguishing between similar phonemes, or aid in forensic voice analysis by identifying unique speaker characteristics. The invention provides a systematic approach to analyzing formant structures, addressing the need for reliable and consistent speech feature extraction.

Claim 8

Original Legal Text

8. A method according to claim 1 , wherein generating modified formant components comprises: modifying the frequency and the bandwidth of the or each formant component.

Plain English Translation

The invention relates to speech processing, specifically modifying formant components in speech signals to alter vocal characteristics. Formant components are resonant frequencies in speech that define vowel sounds and contribute to voice quality. The problem addressed is the need to adjust these components to achieve desired vocal effects, such as changing pitch, timbre, or speaker identity, without distorting speech intelligibility. The method involves analyzing a speech signal to identify formant components, which are key frequency bands that shape the sound of vowels. Once identified, the method modifies the frequency and bandwidth of these components. Frequency adjustments shift the resonant peaks, altering the perceived vowel quality or pitch. Bandwidth modifications change the sharpness or width of the formant peaks, affecting the clarity and timbre of the sound. By precisely controlling these parameters, the method can produce natural-sounding speech variations or synthetic voices with customized vocal characteristics. This technique is useful in applications like voice conversion, speech synthesis, and assistive technologies, where adapting speech to different contexts or user preferences is required. The modifications are applied in a way that maintains speech intelligibility while achieving the desired vocal effects. The method can be implemented in real-time or offline processing systems, depending on the application requirements.

Claim 9

Original Legal Text

9. A method according to claim 1 , wherein the features extracted from the spectrum of the user's speech comprise Mel Frequency Cepstral Coefficients.

Plain English Translation

This invention relates to speech processing, specifically extracting and analyzing features from a user's speech spectrum to improve speech recognition or analysis. The core problem addressed is the need for robust and discriminative features that accurately represent speech characteristics for tasks like speaker identification, emotion detection, or speech recognition. The method involves capturing a user's speech signal and converting it into a frequency spectrum. From this spectrum, features are extracted using Mel Frequency Cepstral Coefficients (MFCCs), a widely used technique in speech processing. MFCCs approximate the human auditory system by applying a Mel-scale filter bank to the spectrum, capturing key acoustic properties while reducing noise and irrelevant frequencies. These coefficients are then used for further analysis, such as classification or pattern recognition. The use of MFCCs enhances the accuracy and reliability of speech analysis by focusing on perceptually relevant frequency components. This approach is particularly useful in noisy environments or for distinguishing between similar-sounding speech patterns. The method may be applied in various applications, including voice assistants, biometric authentication, and speech-based diagnostics. By leveraging MFCCs, the system achieves improved feature representation, leading to better performance in speech-related tasks.

Claim 10

Original Legal Text

10. A method according to claim 1 , wherein the step of forming at least one speech model comprises forming a model of the speaker's speech.

Plain English Translation

This invention relates to speech processing, specifically methods for forming and using speech models to improve speech recognition or synthesis. The core problem addressed is the variability in speech patterns across different speakers, which can degrade the accuracy of speech recognition systems or the naturalness of synthesized speech. The method involves forming at least one speech model, which includes creating a model of the speaker's unique speech characteristics. This model captures the speaker's voice patterns, such as pronunciation, intonation, and speech rhythm, to enable more accurate speech recognition or more natural speech synthesis. The model may be derived from analyzing the speaker's recorded speech, extracting features that define their distinct vocal traits. The method may also include preprocessing the speech input to enhance its quality before model formation, such as noise reduction or normalization. Additionally, the model may be adapted or refined over time as the speaker's voice changes or as more speech data becomes available. The resulting speech model can be used in various applications, including voice assistants, transcription services, or speech synthesis systems, where speaker-specific accuracy or naturalness is important. By tailoring the speech model to the individual speaker, the method improves the performance of speech processing systems compared to generic models that do not account for speaker-specific variations.

Claim 11

Original Legal Text

11. A method according to claim 10 , wherein the method is performed on enrolling the speaker in the speaker recognition system.

Plain English Translation

A speaker recognition system is used to verify or identify individuals based on their voice characteristics. A challenge in such systems is ensuring accurate and reliable speaker enrollment, where the system captures and stores a speaker's voice profile for future recognition. This involves collecting high-quality voice samples while minimizing errors due to background noise, speaker variability, or improper recording conditions. The method involves a process for enrolling a speaker in a speaker recognition system. During enrollment, the system captures multiple voice samples from the speaker. These samples are analyzed to determine their quality, such as signal clarity, noise levels, and consistency with expected voice characteristics. Low-quality samples are discarded, while high-quality samples are retained for building the speaker's voice profile. The system may also compare the samples to ensure they represent the same speaker, reducing the risk of enrollment errors. Once sufficient high-quality samples are collected, the system generates a speaker model that can be used for future recognition tasks. This method improves the reliability of speaker recognition by ensuring that only high-quality, consistent voice data is used to create the speaker's profile.

Claim 12

Original Legal Text

12. A non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method comprising: receiving a signal comprising a speaker's speech; and for a plurality of frames of the signal: obtaining a spectrum of the speaker's speech; generating at least one modified spectrum, by applying effects related to a respective vocal effort, wherein the step of generating at least one modified spectrum comprises: determining a frequency and a bandwidth of at least one formant component of the speaker's speech; generating at least one modified formant component by modifying at least one of the frequency and the bandwidth of the or each formant component; and generating the modified spectrum from the or each modified formant component; extracting features from the spectrum of the speaker's speech and the at least one modified spectrum; and further comprising: forming at least one speech model based on the extracted features.

Plain English Translation

The invention relates to speech processing, specifically modifying speech signals to simulate different vocal efforts and improving speech recognition or synthesis by analyzing modified speech spectra. The problem addressed is the need to accurately model variations in speech due to vocal effort, such as changes in loudness or emotional tone, which can affect speech recognition and synthesis systems. The method involves processing a speech signal by analyzing multiple frames of the signal. For each frame, the system obtains the speech spectrum and generates modified spectra by applying effects related to vocal effort. This includes identifying the frequency and bandwidth of formant components in the speech, which are key acoustic features representing resonant frequencies of the vocal tract. The system then modifies these formant components by adjusting their frequency or bandwidth to simulate different vocal efforts. From these modified formants, new spectra are generated. Features are extracted from both the original and modified spectra, and these features are used to form a speech model. This model can improve speech recognition accuracy or enhance speech synthesis by accounting for variations in vocal effort. The approach allows for better adaptation to different speaking styles and conditions, making speech processing systems more robust.

Claim 13

Original Legal Text

13. A system for speaker modelling, the system comprising: an input, for receiving a signal comprising a speaker's speech; and, a processor, configured for, for a plurality of frames of the signal: obtaining a spectrum of the speaker's speech; generating at least one modified spectrum, by applying effects related to a respective vocal effort, wherein the step of generating at least one modified spectrum comprises: determining a frequency and a bandwidth of at least one formant component of the speaker's speech; generating at least one modified formant component by modifying at least one of the frequency and the bandwidth of the or each formant component; and generating the modified spectrum from the or each modified formant component; extracting features from the spectrum of the speaker's speech and the at least one modified spectrum; and forming at least one speech model based on the extracted features.

Plain English Translation

The system is designed for speaker modeling, addressing the challenge of accurately representing a speaker's voice characteristics across varying vocal efforts, such as changes in loudness or emotional intensity. The system receives a speech signal and processes it frame-by-frame to analyze and modify spectral features. For each frame, the system obtains the speech spectrum and generates modified versions by adjusting formant components, which are key frequency bands in speech that define vowel sounds. The system determines the frequency and bandwidth of these formants and then alters them to simulate different vocal efforts, such as louder or softer speech. The modified formants are used to create new spectra, which are analyzed alongside the original spectrum to extract features. These features are then used to build a speech model that captures the speaker's voice characteristics under different conditions. The system improves speaker recognition and synthesis by accounting for variations in vocal effort, ensuring more robust and accurate modeling.

Patent Metadata

Filing Date

Unknown

Publication Date

November 17, 2020

Inventors

Rahim SAEIDI

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “SPEAKER ENROLLMENT” (10839810). https://patentable.app/patents/10839810

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/10839810. See llms.txt for full attribution policy.