Estimating Pitch Using Peak-To-Peak Distances

PublishedDecember 12, 2017

Assigneenot available in USPTO data we have

InventorsDavid C. Bradley Yao Huang Morin Ellisha Marongelli

Technical Abstract

Patent Claims

18 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A computer-implemented method for automatic speaker recognition, the method comprising: obtaining a first portion of a speech signal; computing, using one or more processing devices, a first frequency representation of the first portion of the speech signal; obtaining a first threshold; identifying a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold; computing, using the one or more processing devices, a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks; obtaining a second threshold; identifying a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold; computing, using the one or more processing devices, a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks; computing, using the one or more processing devices, a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances; obtaining a second portion of the speech signal; computing, using the one or more processing devices, a second frequency representation of the second portion of the speech signal; identifying a third plurality of peaks in the second frequency representation; computing, using the one or more processing devices, a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks; computing, using the one or more processing devices, a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances; generating, using the one or more processing devices, a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and applying the sequence of pitch estimates to recognize a speaker as a source of the speech signal.

2. The method of claim 1 , wherein computing the first pitch estimate of the first portion comprises estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.

3. The method of claim 1 , further comprising computing a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and wherein computing the first pitch estimate of the first portion of the speech signal comprises computing the first pitch estimate using the histogram.

4. The method of claim 1 , wherein the first frequency representation is computed using an estimated fractional chirp rate of the first portion of the speech signal.

5. The method of claim 1 , wherein computing the first frequency representation comprises using a first smoothing kernel.

6. The method of claim 1 , wherein the first frequency representation comprises a log likelihood ratio (LLR) spectrum.

7. The method of claim 1 , wherein the first frequency representation comprises a stationary spectrum.

8. A system for automatic speech recognition, the system comprising one or more computing devices comprising at least one processor and at least one memory, the one or more computing devices configured to: obtain a first portion of a speech signal; compute a first frequency representation of the first portion of the speech signal; obtain a first threshold; identify a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold; compute a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks; obtain a second threshold; identify a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold; compute a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks; compute a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances; obtain a second portion of the speech signal; compute a second frequency representation of the second portion of the speech signal; identify a third plurality of peaks in the second frequency representation; compute a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks; compute a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances; generate a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and apply the sequence of pitch estimates to perform automatic speech recognition on the speech signal.

9. The system of claim 8 , wherein the one or more computing devices are further configured to compute the first pitch estimate of the first portion by estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.

10. The system of claim 8 , wherein the one or more computing devices are further configured to compute a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and generate the first pitch estimate of the first portion speech-signal using the histogram.

11. The system of claim 8 , wherein the one or more computing devices are further configured to compute the first frequency representation using a first smoothing kernel.

12. The system of claim 8 , wherein the first frequency representation comprises a log-likelihood ratio (LLR) spectrum.

13. The system of claim 8 , wherein the one or more computing devices are further configured to: compute the first pitch estimate of the first portion of the speech signal by identifying a most frequently occurring peak-to-peak distance from the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.

14. The system of claim 11 , wherein the one or more computing devices are further configured to: compute a third frequency representation of the first portion of the speech signal using a second smoothing kernel; identify a fourth plurality of peaks in the third frequency representation; compute a fourth plurality of peak-to-peak distances using locations in frequency of the fourth plurality of peaks; and compute a third pitch estimate of the first portion of the speech signal using the fourth plurality of peak-to-peak distances.

15. One or more non-transitory computer-readable media comprising computer executable instructions that, when executed, cause at least one processor to perform actions comprising: obtaining a first portion of a speech signal; computing a first frequency representation of the first portion of the speech signal; obtaining a first threshold; identifying a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold; computing a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks; obtaining a second threshold; identifying a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold; computing a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks; computing a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances; obtaining a second portion of the speech signal; computing a second frequency representation of the second portion of the speech signal; identifying a third plurality of peaks in the second frequency representation; computing a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks; computing a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances; generating a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and applying the sequence of pitch estimates to recognize a speaker as a source of the speech signal.

16. The one or more non-transitory computer-readable media of claim 15 , wherein computing the first pitch estimate of the first portion comprises estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.

17. The one or more non-transitory computer-readable media of claim 15 , further comprising computing a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and wherein computing the first pitch estimate of the first portion of the speech signal comprises computing the first pitch estimate using the histogram.

18. The one or more non-transitory computer-readable media of claim 15 , wherein the first frequency representation comprises a log-likelihood ratio (LLR) spectrum.

Patent Metadata

Filing Date

Unknown

Publication Date

December 12, 2017

Inventors

David C. Bradley

Yao Huang Morin

Ellisha Marongelli

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search