9842611

Estimating Pitch Using Peak-To-Peak Distances

PublishedDecember 12, 2017
Assigneenot available in USPTO data we have
Technical Abstract

Patent Claims
18 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A computer-implemented method for automatic speaker recognition, the method comprising: obtaining a first portion of a speech signal; computing, using one or more processing devices, a first frequency representation of the first portion of the speech signal; obtaining a first threshold; identifying a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold; computing, using the one or more processing devices, a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks; obtaining a second threshold; identifying a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold; computing, using the one or more processing devices, a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks; computing, using the one or more processing devices, a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances; obtaining a second portion of the speech signal; computing, using the one or more processing devices, a second frequency representation of the second portion of the speech signal; identifying a third plurality of peaks in the second frequency representation; computing, using the one or more processing devices, a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks; computing, using the one or more processing devices, a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances; generating, using the one or more processing devices, a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and applying the sequence of pitch estimates to recognize a speaker as a source of the speech signal.

Plain English Translation

A computer-implemented method for automatic speaker recognition analyzes speech signals to estimate pitch and identify the speaker. It divides the speech signal into portions, calculates a frequency representation for each portion, and identifies peaks in the frequency spectrum using two different thresholds. It calculates peak-to-peak distances from the locations of these peaks, uses these distances to estimate the pitch of each portion, and combines the pitch estimates into a sequence. This sequence of pitch estimates is then used to recognize the speaker.

Claim 2

Original Legal Text

2. The method of claim 1 , wherein computing the first pitch estimate of the first portion comprises estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.

Plain English Translation

The method for automatic speaker recognition described in claim 1 refines the pitch estimation process. Instead of directly using the peak-to-peak distances, it estimates a cumulative distribution function (CDF) from the combined set of peak-to-peak distances obtained using the two thresholds. This CDF is then used to compute a more accurate pitch estimate for that portion of the speech signal.

Claim 3

Original Legal Text

3. The method of claim 1 , further comprising computing a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and wherein computing the first pitch estimate of the first portion of the speech signal comprises computing the first pitch estimate using the histogram.

Plain English Translation

The method for automatic speaker recognition described in claim 1 enhances pitch estimation by creating a histogram of peak-to-peak distances. It computes a histogram using the combined peak-to-peak distances, collected using the two thresholds, in the frequency representation. This histogram is then used to compute the pitch estimate for the analyzed portion of the speech signal, potentially by identifying the most frequent distance.

Claim 4

Original Legal Text

4. The method of claim 1 , wherein the first frequency representation is computed using an estimated fractional chirp rate of the first portion of the speech signal.

Plain English Translation

In the method for automatic speaker recognition described in claim 1, the frequency representation of the speech signal is calculated using an estimated fractional chirp rate. This means that the algorithm considers how the frequency content of the signal changes over time (the "chirp rate") when converting the speech signal into its frequency domain representation, potentially improving accuracy for non-stationary signals.

Claim 5

Original Legal Text

5. The method of claim 1 , wherein computing the first frequency representation comprises using a first smoothing kernel.

Plain English Translation

In the method for automatic speaker recognition described in claim 1, the process of computing the frequency representation involves applying a smoothing kernel. This smoothing kernel is used to reduce noise and improve the clarity of the frequency representation before peak detection, potentially by averaging frequency values within a small range.

Claim 6

Original Legal Text

6. The method of claim 1 , wherein the first frequency representation comprises a log likelihood ratio (LLR) spectrum.

Plain English Translation

In the method for automatic speaker recognition described in claim 1, the frequency representation that is calculated and analyzed is a log-likelihood ratio (LLR) spectrum. This choice of frequency representation may improve the system's ability to differentiate between different speakers or to reduce the impact of noise in the input signal.

Claim 7

Original Legal Text

7. The method of claim 1 , wherein the first frequency representation comprises a stationary spectrum.

Plain English Translation

In the method for automatic speaker recognition described in claim 1, the frequency representation that is calculated is a stationary spectrum. This suggests that the method assumes the speech signal is relatively stable within the analyzed portion, allowing for simpler frequency analysis techniques.

Claim 8

Original Legal Text

8. A system for automatic speech recognition, the system comprising one or more computing devices comprising at least one processor and at least one memory, the one or more computing devices configured to: obtain a first portion of a speech signal; compute a first frequency representation of the first portion of the speech signal; obtain a first threshold; identify a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold; compute a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks; obtain a second threshold; identify a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold; compute a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks; compute a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances; obtain a second portion of the speech signal; compute a second frequency representation of the second portion of the speech signal; identify a third plurality of peaks in the second frequency representation; compute a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks; compute a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances; generate a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and apply the sequence of pitch estimates to perform automatic speech recognition on the speech signal.

Plain English Translation

A system for automatic speaker recognition includes a computing device (processor and memory) programmed to analyze speech signals to estimate pitch and identify the speaker. The system divides the speech signal into portions, calculates a frequency representation for each portion, and identifies peaks using two different thresholds. It calculates peak-to-peak distances from these peaks, uses these distances to estimate the pitch of each portion, and combines the pitch estimates into a sequence. This sequence is then used to recognize the speaker.

Claim 9

Original Legal Text

9. The system of claim 8 , wherein the one or more computing devices are further configured to compute the first pitch estimate of the first portion by estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.

Plain English Translation

The automatic speaker recognition system described in claim 8 refines its pitch estimation. Instead of directly using peak-to-peak distances, the system estimates a cumulative distribution function (CDF) from the combined set of peak-to-peak distances obtained using the two thresholds. This CDF is then used to compute a more accurate pitch estimate for that portion of the speech signal.

Claim 10

Original Legal Text

10. The system of claim 8 , wherein the one or more computing devices are further configured to compute a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and generate the first pitch estimate of the first portion speech-signal using the histogram.

Plain English Translation

The automatic speaker recognition system described in claim 8 improves pitch estimation using a histogram of peak-to-peak distances. The system computes a histogram using the combined peak-to-peak distances, collected using the two thresholds, in the frequency representation. This histogram is then used to generate the pitch estimate, potentially by identifying the most frequent distance.

Claim 11

Original Legal Text

11. The system of claim 8 , wherein the one or more computing devices are further configured to compute the first frequency representation using a first smoothing kernel.

Plain English Translation

In the automatic speaker recognition system described in claim 8, the frequency representation is computed by applying a smoothing kernel. This smoothing reduces noise and improves the clarity of the frequency representation prior to peak detection, potentially by averaging frequency values within a small range.

Claim 12

Original Legal Text

12. The system of claim 8 , wherein the first frequency representation comprises a log-likelihood ratio (LLR) spectrum.

Plain English Translation

In the automatic speaker recognition system described in claim 8, the frequency representation calculated and analyzed is a log-likelihood ratio (LLR) spectrum. This representation may improve the system's ability to differentiate between different speakers or to reduce the impact of noise in the input signal.

Claim 13

Original Legal Text

13. The system of claim 8 , wherein the one or more computing devices are further configured to: compute the first pitch estimate of the first portion of the speech signal by identifying a most frequently occurring peak-to-peak distance from the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.

Plain English Translation

The automatic speaker recognition system described in claim 8 calculates the pitch estimate by identifying the most frequently occurring peak-to-peak distance from the combined set of distances generated using two different thresholds. This most frequent distance is then taken as the pitch estimate.

Claim 14

Original Legal Text

14. The system of claim 11 , wherein the one or more computing devices are further configured to: compute a third frequency representation of the first portion of the speech signal using a second smoothing kernel; identify a fourth plurality of peaks in the third frequency representation; compute a fourth plurality of peak-to-peak distances using locations in frequency of the fourth plurality of peaks; and compute a third pitch estimate of the first portion of the speech signal using the fourth plurality of peak-to-peak distances.

Plain English Translation

The automatic speaker recognition system using a smoothing kernel as described in claim 11, further refines pitch estimation by computing a *third* frequency representation of a speech signal portion. This third representation uses a *different* smoothing kernel. The system identifies peaks and peak-to-peak distances within this *third* frequency representation, and then computes a *third* pitch estimate based on these distances, potentially improving overall pitch estimation accuracy by combining multiple analyses with different smoothing parameters.

Claim 15

Original Legal Text

15. One or more non-transitory computer-readable media comprising computer executable instructions that, when executed, cause at least one processor to perform actions comprising: obtaining a first portion of a speech signal; computing a first frequency representation of the first portion of the speech signal; obtaining a first threshold; identifying a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold; computing a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks; obtaining a second threshold; identifying a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold; computing a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks; computing a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances; obtaining a second portion of the speech signal; computing a second frequency representation of the second portion of the speech signal; identifying a third plurality of peaks in the second frequency representation; computing a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks; computing a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances; generating a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and applying the sequence of pitch estimates to recognize a speaker as a source of the speech signal.

Plain English Translation

Non-transitory computer-readable media store instructions that, when executed, perform automatic speaker recognition by analyzing speech signals to estimate pitch and identify the speaker. The instructions cause the system to divide the speech signal into portions, calculate a frequency representation for each portion, and identify peaks using two different thresholds. Peak-to-peak distances are calculated from these peaks and used to estimate the pitch of each portion, combining pitch estimates into a sequence. The sequence is then used to recognize the speaker.

Claim 16

Original Legal Text

16. The one or more non-transitory computer-readable media of claim 15 , wherein computing the first pitch estimate of the first portion comprises estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.

Plain English Translation

The computer-readable media in claim 15 refines pitch estimation by estimating a cumulative distribution function (CDF) from the combined set of peak-to-peak distances obtained using two thresholds. This CDF is then used to compute a more accurate pitch estimate for a speech signal portion, rather than using the peak-to-peak distances directly.

Claim 17

Original Legal Text

17. The one or more non-transitory computer-readable media of claim 15 , further comprising computing a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and wherein computing the first pitch estimate of the first portion of the speech signal comprises computing the first pitch estimate using the histogram.

Plain English Translation

The computer-readable media in claim 15 improves pitch estimation by creating a histogram of peak-to-peak distances. The instructions compute a histogram from the combined peak-to-peak distances, collected using two thresholds, in the frequency representation. This histogram is then used to compute the pitch estimate for the analyzed portion of the speech signal.

Claim 18

Original Legal Text

18. The one or more non-transitory computer-readable media of claim 15 , wherein the first frequency representation comprises a log-likelihood ratio (LLR) spectrum.

Plain English Translation

The computer-readable media in claim 15 uses a log-likelihood ratio (LLR) spectrum as the frequency representation. This choice of frequency representation may improve the system's ability to differentiate between different speakers or reduce the impact of noise in the input signal.

Patent Metadata

Filing Date

Unknown

Publication Date

December 12, 2017

Inventors

David C. Bradley
Yao Huang Morin
Ellisha Marongelli

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “ESTIMATING PITCH USING PEAK-TO-PEAK DISTANCES” (9842611). https://patentable.app/patents/9842611

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/9842611. See llms.txt for full attribution policy.

ESTIMATING PITCH USING PEAK-TO-PEAK DISTANCES