Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A speech processing apparatus comprising: a frame extraction unit configured to extract a signal portion per frame having a specific duration from an input signal that includes periodic non-speech segments, thus generating a per-frame input signal; a spectrum generation unit configured to convert the per-frame input signal in a time domain into a per-frame input signal in a frequency domain, thereby generating a spectral pattern of spectra; a peak detection unit configured to detect peak spectra having peaks in the spectral pattern by determining at least one spectrum of a first spectrum group of a predetermined number of spectra as the peak spectrum based on a predetermined criterion if an energy ratio of total energy of the first spectrum group to total energy of a second group of the predetermined number of spectra, next to the first spectrum group in the spectral pattern, is equal to or higher than a predetermined threshold level; and a harmonic-overtone determination unit configured to determine a harmonic spectrum, in the peak spectra, having a harmonic structure showing a relationship between a fundamental pitch and a harmonic overtone based on a barycentric frequency weighted by energy of each of the peak spectra; and a noise attenuation unit configured to attenuate energy corresponding to spectra obtained by removing the harmonic spectrum from the peak spectra in the spectral pattern.
A speech processing system identifies and removes noise from audio containing periodic non-speech sounds. It divides the audio into short frames and converts each frame from a time-based signal to a frequency-based spectrum. The system finds prominent frequency peaks in each frame's spectrum. It then identifies "harmonic" peaks - those whose frequencies relate to a fundamental pitch and its overtones. Finally, the system reduces the energy of the non-harmonic peaks, effectively attenuating noise. Peak spectra are determined by comparing the energy of a group of spectra to the energy of a neighboring group of spectra. If the ratio exceeds a threshold, at least one spectrum from the first group is flagged as a peak. The harmonic peaks are determined based on a barycentric frequency weighted by the energy of each peak spectrum.
2. The speech processing apparatus according to claim 1 , wherein a frequency bandwidth that covers the first spectrum group is narrower than 100 Hz.
In the speech processing system described previously, the frequency range covered by the group of spectra used for identifying peak spectra is less than 100 Hz. This narrow bandwidth allows for more precise detection of individual peaks, improving the accuracy of harmonic peak identification and subsequent noise attenuation. This helps focus processing on distinct spectral components.
3. The speech processing apparatus according to claim 1 , wherein the spectrum generation unit generates the spectral pattern at frequency resolution lower than 33 Hz.
In the speech processing system described earlier, the system creates the frequency-based spectrum with a resolution finer than 33 Hz. This means each frequency "bin" in the spectrum is smaller than 33 Hz. This higher resolution allows the system to distinguish between closely spaced frequency components, leading to more accurate peak detection and better noise reduction.
4. The speech processing apparatus according to claim 1 , wherein the spectrum generation unit generates the spectral pattern in a range from 200 Hz to 2000 Hz.
The speech processing system described in claim 1 generates the spectral pattern within the frequency range of 200 Hz to 2000 Hz. This band is particularly relevant for human speech, which focuses the noise reduction processing on the most important parts of the audio signal and improves computational efficiency by ignoring frequencies outside the typical speech range.
5. The speech processing apparatus according to claim 1 further comprising: a speech determination unit configured to determine whether the per-frame input signal is a speech segment based on the energy-attenuated spectral pattern.
The speech processing system from the first description also determines if an audio frame contains speech based on the spectrum after noise attenuation. This enables the system to distinguish speech segments from background noise and use the noise-reduced spectrum to classify segments appropriately. The system classifies audio frames depending on the energy present in different frequency bands.
6. The speech processing apparatus according to claim 1 further comprising: a noise reduction unit configured to reduce a noise component in the per-frame input signal.
The speech processing system described initially further includes a module to reduce noise in the input audio frame. This noise reduction can involve filtering techniques to suppress unwanted sounds or statistical methods to estimate and subtract noise profiles from the audio signal. Reducing the initial noise improves the overall quality of the processed speech signal and enhances the performance of subsequent processing stages.
7. The speech processing apparatus according to claim 1 , wherein the predetermined criterion is that, if there are an odd number of spectra in the spectral pattern, determined as the peak spectrum is a specific spectrum having a barycentric frequency in the spectra in the spectral pattern or a spectrum next to the specific spectrum in the spectral pattern.
In the speech processing system described in the first claim, the system identifies peak spectra based on a "barycentric frequency" calculation when the spectral pattern contains an odd number of spectra. If the spectral pattern has an odd number of spectra, the specific spectrum with a barycentric frequency in the spectral pattern, or the spectrum next to the specific spectrum is determined as the peak spectrum.
8. The speech processing apparatus according to claim 1 , wherein the predetermined criterion is that if there are an even number of spectra in the spectral pattern, determined as the peak spectrum is either or both of two specific spectra having a frequency closest to the barycentric frequency in the spectra in the spectral pattern or spectra next to the two spectra in the spectral pattern.
In the speech processing system described in the first claim, the system identifies peak spectra based on a "barycentric frequency" calculation when the spectral pattern contains an even number of spectra. If the spectral pattern has an even number of spectra, either or both of the two specific spectra with a frequency closest to the barycentric frequency, or spectra next to the two spectra in the spectral pattern are determined as the peak spectra.
Unknown
August 26, 2014
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.