Provided is a signal processing apparatus that includes a reference signal generating section that generates a reference signal corresponding to a target sound on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound. The signal processing apparatus further includes a sound source extracting section that extracts, from the mixed sound signal of one frame or multiple frames, a signal of one frame which is similar to the reference signal and in which the target sound is more enhanced.
Legal claims defining the scope of protection, as filed with the USPTO.
. A signal processing apparatus, comprising:
. The signal processing apparatus according to, wherein
. The signal processing apparatus according to, wherein
. A signal processing method, comprising:
. A non-transitory computer-readable medium having stored thereon, computer-executable instructions which, when executed by a computer, cause the computer to execute operations, the operations comprising:
. A signal processing apparatus, comprising:
. The signal processing apparatus according to, wherein
. The signal processing apparatus according to, wherein the sound source extracting section is further configured to extract the final signal of one frame of multiple frames from the mixed sound signal of one of the one frame or the multiple frames.
. The signal processing apparatus according to, wherein
. A signal processing method, comprising:
. A non-transitory computer-readable medium having stored thereon, computer-executable instructions which, when executed by a computer, cause the computer to execute operations, the operations comprising:
. A signal processing apparatus, comprising:
. The signal processing apparatus according to, wherein a process of each of the estimation of the extraction filter and the extraction of the specific signal from the mixed sound signal is executed iteratively.
. The signal processing apparatus according to, wherein the sound source extracting section is further configured to update the adjustable parameter and update the extraction filter alternately.
. The signal processing apparatus according to, wherein in a case where the process of the generation of the reference signal and the process of the estimation of the extraction filter and the extraction of the specific signal from the mixed sound signal are executed iteratively:
. The signal processing apparatus according to, wherein the sound source model is one of:
. A signal processing method, comprising:
. A non-transitory computer-readable medium having stored thereon, computer-executable instructions which, when executed by a computer, cause the computer to execute operations, the operations comprising:
Complete technical specification and implementation details from the patent document.
This application is a U.S. National Phase of International Patent Application No. PCT/JP2022/000834 filed on Jan. 13, 2022, which claims priority benefit of Japanese Patent Application No. JP 2021-038488 filed in the Japan Patent Office on Mar. 10, 2021. Each of the above-referenced applications is hereby incorporated herein by reference in its entirety.
The present technology relates to a signal processing apparatus, a signal processing method, and a program and, in particular, relates to a signal processing apparatus, a signal processing method, and a program that make it possible to improve precision of target sound extraction.
There has been technologies proposed to extract a sound that is desired to be extracted (hereinafter, referred to as a target sound as appropriate) from a mixed sound signal which is a mixture of the target sound and a sound that is desired to be removed (hereinafter, referred to as an interfering sound as appropriate) (e.g., see PTL 1 to PTL 3 described below.).
[PTL 1]
Japanese Patent Laid-open No. 2006-72163
[PTL 2]
Japanese Patent No. 4449871
[PTL 3]
Japanese Patent Laid-open No. 2014-219467
It is desired in such a field to improve precision of target sound extraction.
The present technology has been made in view of such a situation, and an object thereof is to make it possible to improve precision of target sound extraction.
A signal processing apparatus according to a first aspect of the present technology includes a reference signal generating section that generates a reference signal corresponding to a target sound on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound, and a sound source extracting section that extracts, from the mixed sound signal of one frame or multiple frames, a signal of one frame which is similar to the reference signal and in which the target sound is more enhanced.
A signal processing method or program according to the first aspect of the present technology includes steps of generating a reference signal corresponding to a target sound on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound, and extracting, from the mixed sound signal of one frame or multiple frames, a signal of one frame which is similar to the reference signal and in which the target sound is more enhanced.
In the first aspect of the present technology, a reference signal corresponding to a target sound is generated on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound, and a signal of one frame which is similar to the reference signal and in which the target sound is more enhanced is extracted from the mixed sound signal of one frame or multiple frames.
A signal processing apparatus according to a second aspect of the present technology includes a reference signal generating section that generates a reference signal corresponding to a target sound on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound, and a sound source extracting section that extracts, from the mixed sound signal, a signal which is similar to the reference signal and in which the target sound is more enhanced. In a case where a process of generating the reference signal and a process of extracting the signal from the mixed sound signal are performed iteratively, the reference signal generating section generates a new reference signal on the basis of the signal extracted from the mixed sound signal, and the sound source extracting section extracts the signal from the mixed sound signal on the basis of the new reference signal.
A signal processing method or program according to the second aspect of the present technology includes performing a process of generating a reference signal corresponding to a target sound on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound and a process of extracting, from the mixed sound signal, a signal which is similar to the reference signal and in which the target sound is more enhanced. In a case where the process of generating the reference signal and the process of extracting the signal from the mixed sound signal are performed iteratively, the signal processing method or program includes steps of generating a new reference signal on the basis of the signal extracted from the mixed sound signal, and extracting the signal from the mixed sound signal on the basis of the new reference signal.
In the second aspect of the present technology, a process of generating a reference signal corresponding to a target sound on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound and a process of extracting, from the mixed sound signal, a signal which is similar to the reference signal and in which the target sound is more enhanced are performed. In a case where the process of generating the reference signal and the process of extracting the signal from the mixed sound signal are performed iteratively, a new reference signal is generated on the basis of the signal extracted from the mixed sound signal, and the signal is extracted from the mixed sound signal on the basis of the new reference signal.
A signal processing apparatus according to a third aspect of the present technology includes a reference signal generating section that generates a reference signal corresponding to a target sound on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound, and a sound source extracting section that estimates an extraction filter as a solution that optimizes an objective function that includes an extraction result being a signal which is similar to the reference signal and in which the target sound is more enhanced by the extraction filter, and an adjustable parameter of a sound source model representing similarity between the extraction result and the reference signal, the objective function reflecting independence and the similarity between the extraction result and a separation result of another imaginary sound source, and extracts the signal from the mixed sound signal on the basis of the estimated extraction filter.
A signal processing method or program according to the third aspect of the present technology includes steps of generating a reference signal corresponding to a target sound on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound, estimating an extraction filter as a solution that optimizes an objective function that includes an extraction result being a signal which is similar to the reference signal and in which the target sound is more enhanced by the extraction filter, and an adjustable parameter of a sound source model representing similarity between the extraction result and the reference signal, the objective function reflecting independence and the similarity between the extraction result and a separation result of another imaginary sound source, and extracting the signal from the mixed sound signal on the basis of the estimated extraction filter.
In the third aspect of the present technology, a reference signal corresponding to a target sound is generated on the basis of a mixed sound signal which is recorded with multiple microphones arranged at different positions and is a mixture of the target sound and a non-target sound. An extraction filter is extracted as a solution that optimizes an objective function that includes an extraction result being a signal which is similar to the reference signal and in which the target sound is more enhanced by the extraction filter, and an adjustable parameter of a sound source model representing similarity between the extraction result and the reference signal, the objective function reflecting independence and the similarity between the extraction result and a separation result of another imaginary sound source. The signal is extracted from the mixed sound signal on the basis of the estimated extraction filter.
[Notation in Present Specification]
(Notation of Formulae)
Note that formulae are explained hereinbelow in accordance with the notation described below.
In the present specification, “sounds (sound signals)” and “voices (voice signals)” are used with different meanings. “Sounds” are used as a term having a typical meaning such as sounds or audio, and “voices” are used as a term representing voices or speeches.
In addition, “separation” and “extraction” are used with different meanings as follows. “Separation” is the opposite of mixing, and is used as a term meaning dividing a signal which is a mixture of multiple raw signals into the respective raw signals (there are both multiple inputs and multiple outputs). “Extraction” is used as a term meaning taking out one raw signal from a signal which is a mixture of multiple raw signals (there are multiple inputs, but there is one output).
“Applying a filter” and “filtering” have the same meaning, and similarly, “applying a mask” and “masking” have the same meaning.
To start with, in order to facilitate understanding of the present disclosure, an overview and the background of the present disclosure and problems that should be considered in the present disclosure are explained.
The present disclosure relates to sound source extraction using reference signals (references). In addition to recording, with multiple microphones, of a signal which is a mixture of a sound that is desired to be extracted (target sound) and a sound that is desired to be eliminated (interfering sound), a signal processing apparatus generates a “rough” amplitude spectrogram corresponding to the target sound, and uses the amplitude spectrogram as a reference signal to thereby generate an extraction result which is similar to and more precise than the reference signal. That is, an embodiment of the present disclosure is a signal processing apparatus that extracts, from the mixed sound signal, a signal which is similar to the reference signal and in which the target sound is more enhanced.
In a process performed at the signal processing apparatus, an objective function reflecting both the similarity between the reference signal and the extraction result and the independence between the extraction result and another imaginary separation result is prepared, and an extraction filter is determined as a solution that optimizes the objective function. With use of a deflation method used in blind sound source separation, a signal to be output can be a signal of only one sound source corresponding to the reference signal. Since it can be regarded as a beamformer considering both the similarity and the independence, it is referred to as a Similarity-and-Independence-aware Beamformer (SIBF) as appropriate hereinbelow.
The present disclosure relates to sound source extraction using reference signals (references). In addition to recording, with multiple microphones, of a signal which is a mixture of a sound that is desired to be extracted (target sound) and a sound that is desired to be eliminated (interfering sound), a “rough” amplitude spectrogram corresponding to the target sound is acquired or generated, and the amplitude spectrogram is used as a reference signal to thereby generate an extraction result which is similar to and more precise than the reference signal.
It is assumed that assumed situations where the present disclosure is used satisfy all of conditions (1) to (3) described below, for example.
(1) An observation signal is recorded synchronously with multiple microphones.
(2) It is assumed that a zone, that is, a time range, in which a target sound is being produced is known, and the observation signal described before includes at least the zone.
(3) It is assumed that, as a reference signal, a rough amplitude spectrogram (rough target sound spectrogram) corresponding to the target sound has been acquired or can be generated from the observation signal described before.
Each of the conditions described above is explained supplementarily.
Regarding the condition (1) described above, the microphones may be or may not be fixed, and in either case, the positions of the microphones and sound sources may be unknown. Examples of fixed microphones include a microphone array, and conceivable examples of unfixed microphones include pin microphones or the like that are worn by speakers.
Regarding the condition (2) described above, the zone in which the target sound is being produced is an utterance zone in a case where, for example, a voice of a particular speaker is to be extracted. It is assumed that, while the zone is known, it is unknown whether or not the target sound is being produced outside the zone. That is, the hypothesis that there is no target sound outside the zone does not hold true in some cases.
Regarding (3) described above, the rough target sound spectrogram means a spectrogram that has deteriorated as compared with the true target sound spectrogram since the spectrogram meets one or more of the following conditions a) to f).
a) It is data of real numbers not including phase information.
b) The target sound is dominant, but an interfering sound is also included.
c) The interfering sound has mostly been removed, but the sound is distorted as a side effect of the removal.
d) In the time direction and/or frequency direction, the resolution has deteriorated as compared with the true target sound spectrogram.
e) Unlike the observation signal, magnitude comparison of the scales of amplitude of the spectrograms is meaningless. For example, even if the amplitude of the rough target sound spectrogram is half the amplitude of an observation signal spectrogram, this does not necessarily mean that the target sound and the interfering sound are included in the observation signal at equal magnitude.
f) It is an amplitude spectrogram generated from a non-sound signal.
The rough target sound spectrogram described above is acquired or generated by a method like the ones below, for example.
One object of the present disclosure is to use, as a reference signal, a rough target sound spectrogram acquired/generated in the manner described above and generate an extraction result which is more precise than the reference signal (and is closer to a true target sound). More specifically, in a sound source extraction process in which a linear filter is applied to a multi-channel observation signal to generate an extraction result, a linear filter to generate an extraction result which is more precise than a reference signal (closer to a true target sound) is estimated.
A linear filter for the sound source extraction process is estimated in the present disclosure for enjoying the following merits that the linear filter provides.
Merit 1: It provides a less distorted extraction result as compared with a non-linear extraction process. Accordingly, in a case where it is combined with voice recognition or the like, deterioration of the recognition precision due to distortions can be avoided.
Merit 2: The phase of the extraction result can be estimated appropriately by a rescaling process described later. Accordingly, in a case where it is combined with post-processing dependent on the phase (also including cases where the extraction result is reproduced as a sound and humans listen to it), it is possible to avoid problems attributable to an inappropriate phase.
Merit 3: The extraction precision can be improved easily by increasing the number of microphones.
Unknown
March 31, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.