A voice activity detection (VAD) system includes an input processing module configured to receive an acoustic signal, convert the acoustic signal into an analog signal, and subsequently, a digital signal; an energy-based detection module configured to receive one of the analog/digital signals and determine a sound activity decision; an area-function-based detection module configured to derive an area-related function from the digital signal and use a machine learning method to output an area-based decision according to the area related function; and a VAD decision module configured to make a final VAD decision based on the sound activity decision from the energy-based detection module and the area-based decision from the area-function-based detection module.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A voice activity detection (VAD) system, comprising: a microphone interface circuit configured for coupling to a microphone to receive an acoustic signal and to convert the acoustic signal to an analog signal; an analog-to-digital converter configured to receive the analog signal to generate a digital signal; and a signal processing circuit configured to receive the digital signal and to determine if the digital signal represents a human voice, wherein the signal processing circuit comprises: an acoustic-energy-based detection module configured to receive one of the analog signal or the digital signal and to provide a sound activity decision that indicates if the acoustic signal is in an audible energy range; an area-function-based detection module configured to extract features of the acoustic signal from the digital signal based on area-related functions, and to use a machine-learning method to determine an area-based decision that indicates if the acoustic signal represents a human voice, wherein the machine-learning method comprises a plurality of coefficients trained by a plurality of labeled area-related functions; and a voice activity detection (VAD) decision module configured to make a final VAD decision based on the sound activity decision from the acoustic-energy-based detection module and the area-based decision from the area-function-based detection module; and a resource-limited device configured to receive the final VAD decision to change an operating mode of the resource-limited device.
2. The system of claim 1 , wherein the area-related function comprises one of a plurality of log-area-ratios, a log area function, an area function, and a sagittal distance function.
3. The system of claim 1 , wherein the area-function-based detection module is configured to perform: filtering the digital signal with a pre-emphasis factor to obtain a pre-emphasized signal; weighting a frame of the pre-emphasized signal to a windowed signal by a window function; converting the windowed signal to a plurality of reflection coefficients; converting the plurality of reflection coefficients to the area-related function; feeding the area-related function to a trained classifier to identify onsets of voice; and issuing the area-based decision.
4. The system of claim 3 , wherein the trained classifier is trained offline by a neural network or a logistic regression.
5. A voice activity detection (VAD) system, comprising: an input processing module configured to receive an acoustic signal via a microphone, the input processing module configured to convert the acoustic signal into an analog signal, and subsequently, a digital signal; an energy-based detection module configured to receive one of the analog signal or the digital signal and determine a sound activity decision; an area-function-based detection module configured to derive an area-related function from the digital signal and use a machine learning method to output an area-based decision according to the area-related function, wherein the machine learning method comprises a plurality of coefficients trained by a plurality of labeled area related functions; and a VAD decision module configured to make a final VAD decision based on the sound activity decision from the energy-based detection module and the area-based decision from the area-function-based detection module, wherein the final VAD decision is subsequently sent to a resource-limited device to change an operating mode of the resource-limited device.
6. The system of claim 5 , wherein the energy-based detection module is a software module receiving the digital signal.
7. The system of claim 5 , wherein the energy-based detection module is a digital hardware block receiving the digital signal.
8. The system of claim 5 , wherein the energy-based detection module is an analog hardware block receiving the analog signal.
9. The system of claim 5 , wherein the area-related function is a plurality of log-area-ratios.
10. The system of claim 5 , wherein the area-related function comprises one of a plurality of log-area-ratios, a log area function, an area function, and a sagittal distance function.
11. The system of claim 5 , wherein the sound activity decision is a soft decision value.
12. The system of claim 5 , wherein the sound activity decision is a hard decision value.
13. The system of claim 5 , wherein the area-function-based detection module is configured to perform the steps of: filtering the digital signal with a pre-emphasis factor to obtain a pre-emphasized signal; weighting a frame of the pre-emphasized signal to a windowed signal by a window function; converting the windowed signal to a plurality of reflection coefficients; converting the plurality of reflection coefficients to the area-related function; feeding the area-related function to a trained classifier to identify onsets of voice; and issuing the area-based decision.
14. The system of claim 13 , wherein the pre-emphasis factor ranges from 0.5 to 0.99.
15. The system of claim 13 , wherein a frame shift ranges from 1 millisecond to 20 milliseconds.
16. The system of claim 13 , wherein the window function is one of Blackman, Blackman-Harris, Bohman, Chebyshev, Gaussian, Hamming, Hanning, Kaiser, Nuttall, Parzen, Taylor, and Tukey.
17. The system of claim 13 , wherein the trained classifier is trained by a neural network offline.
18. The system of claim 13 , wherein the trained classifier is trained by a logistic regression offline.
19. The system of claim 13 , wherein the area-based decision is a soft decision value.
20. The system of claim 13 , wherein the area-based decision is a hard decision value.
21. The system of claim 13 , wherein the area-function-based detection module is configured to further generate a linear predictive error and include this error value to be a feature in the area-based decision.
22. The system of claim 5 , further comprises a zero-crossing-based detection module configured to generate a second decision based on a zero crossing rate, wherein the VAD decision module includes the second decision in a final decision process.
23. The system of claim 22 , wherein the second decision is a soft decision value.
24. The system of claim 22 , wherein the second decision is a hard decision value.
25. The system of claim 5 , wherein the resource-limited device is a low power device and the operating mode comprises an idle mode and a wake up mode.
26. The system of claim 5 , wherein the resource-limited device is a voice storage device and the operating mode comprises an idle mode and a recording mode.
27. The system of claim 5 , wherein the resource-limited device is a voice transmitting device and the operating mode comprises an idle mode and a transmitting mode.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 28, 2018
October 29, 2019
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.