US-10460749

Voice activity detection using vocal tract area information

PublishedOctober 29, 2019

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A voice activity detection (VAD) system includes an input processing module configured to receive an acoustic signal, convert the acoustic signal into an analog signal, and subsequently, a digital signal; an energy-based detection module configured to receive one of the analog/digital signals and determine a sound activity decision; an area-function-based detection module configured to derive an area-related function from the digital signal and use a machine learning method to output an area-based decision according to the area related function; and a VAD decision module configured to make a final VAD decision based on the sound activity decision from the energy-based detection module and the area-based decision from the area-function-based detection module.

Patent Claims

27 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A voice activity detection (VAD) system, comprising: a microphone interface circuit configured for coupling to a microphone to receive an acoustic signal and to convert the acoustic signal to an analog signal; an analog-to-digital converter configured to receive the analog signal to generate a digital signal; and a signal processing circuit configured to receive the digital signal and to determine if the digital signal represents a human voice, wherein the signal processing circuit comprises: an acoustic-energy-based detection module configured to receive one of the analog signal or the digital signal and to provide a sound activity decision that indicates if the acoustic signal is in an audible energy range; an area-function-based detection module configured to extract features of the acoustic signal from the digital signal based on area-related functions, and to use a machine-learning method to determine an area-based decision that indicates if the acoustic signal represents a human voice, wherein the machine-learning method comprises a plurality of coefficients trained by a plurality of labeled area-related functions; and a voice activity detection (VAD) decision module configured to make a final VAD decision based on the sound activity decision from the acoustic-energy-based detection module and the area-based decision from the area-function-based detection module; and a resource-limited device configured to receive the final VAD decision to change an operating mode of the resource-limited device.

2. The system of claim 1 , wherein the area-related function comprises one of a plurality of log-area-ratios, a log area function, an area function, and a sagittal distance function.

3. The system of claim 1 , wherein the area-function-based detection module is configured to perform: filtering the digital signal with a pre-emphasis factor to obtain a pre-emphasized signal; weighting a frame of the pre-emphasized signal to a windowed signal by a window function; converting the windowed signal to a plurality of reflection coefficients; converting the plurality of reflection coefficients to the area-related function; feeding the area-related function to a trained classifier to identify onsets of voice; and issuing the area-based decision.

4. The system of claim 3 , wherein the trained classifier is trained offline by a neural network or a logistic regression.

5. A voice activity detection (VAD) system, comprising: an input processing module configured to receive an acoustic signal via a microphone, the input processing module configured to convert the acoustic signal into an analog signal, and subsequently, a digital signal; an energy-based detection module configured to receive one of the analog signal or the digital signal and determine a sound activity decision; an area-function-based detection module configured to derive an area-related function from the digital signal and use a machine learning method to output an area-based decision according to the area-related function, wherein the machine learning method comprises a plurality of coefficients trained by a plurality of labeled area related functions; and a VAD decision module configured to make a final VAD decision based on the sound activity decision from the energy-based detection module and the area-based decision from the area-function-based detection module, wherein the final VAD decision is subsequently sent to a resource-limited device to change an operating mode of the resource-limited device.

6. The system of claim 5 , wherein the energy-based detection module is a software module receiving the digital signal.

7. The system of claim 5 , wherein the energy-based detection module is a digital hardware block receiving the digital signal.

8. The system of claim 5 , wherein the energy-based detection module is an analog hardware block receiving the analog signal.

9. The system of claim 5 , wherein the area-related function is a plurality of log-area-ratios.

10. The system of claim 5 , wherein the area-related function comprises one of a plurality of log-area-ratios, a log area function, an area function, and a sagittal distance function.

11. The system of claim 5 , wherein the sound activity decision is a soft decision value.

12. The system of claim 5 , wherein the sound activity decision is a hard decision value.

13. The system of claim 5 , wherein the area-function-based detection module is configured to perform the steps of: filtering the digital signal with a pre-emphasis factor to obtain a pre-emphasized signal; weighting a frame of the pre-emphasized signal to a windowed signal by a window function; converting the windowed signal to a plurality of reflection coefficients; converting the plurality of reflection coefficients to the area-related function; feeding the area-related function to a trained classifier to identify onsets of voice; and issuing the area-based decision.

14. The system of claim 13 , wherein the pre-emphasis factor ranges from 0.5 to 0.99.

15. The system of claim 13 , wherein a frame shift ranges from 1 millisecond to 20 milliseconds.

16. The system of claim 13 , wherein the window function is one of Blackman, Blackman-Harris, Bohman, Chebyshev, Gaussian, Hamming, Hanning, Kaiser, Nuttall, Parzen, Taylor, and Tukey.

17. The system of claim 13 , wherein the trained classifier is trained by a neural network offline.

18. The system of claim 13 , wherein the trained classifier is trained by a logistic regression offline.

19. The system of claim 13 , wherein the area-based decision is a soft decision value.

20. The system of claim 13 , wherein the area-based decision is a hard decision value.

21. The system of claim 13 , wherein the area-function-based detection module is configured to further generate a linear predictive error and include this error value to be a feature in the area-based decision.

22. The system of claim 5 , further comprises a zero-crossing-based detection module configured to generate a second decision based on a zero crossing rate, wherein the VAD decision module includes the second decision in a final decision process.

23. The system of claim 22 , wherein the second decision is a soft decision value.

24. The system of claim 22 , wherein the second decision is a hard decision value.

25. The system of claim 5 , wherein the resource-limited device is a low power device and the operating mode comprises an idle mode and a wake up mode.

26. The system of claim 5 , wherein the resource-limited device is a voice storage device and the operating mode comprises an idle mode and a recording mode.

27. The system of claim 5 , wherein the resource-limited device is a voice transmitting device and the operating mode comprises an idle mode and a transmitting mode.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G10L

Patent Metadata

Filing Date

June 28, 2018

Publication Date

October 29, 2019

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search