Video-Based Sound Source Separation

PublishedSeptember 18, 2018

Assigneenot available in USPTO data we have

InventorsJohann Citerin Gérald Kergourlay

Technical Abstract

Patent Claims

19 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A sound source separation method comprising: determining at least one location of a sound source based on video data, determining at least one time-independent parameter characterizing an audio signal emitted by the sound source in the video data, the at least one time-independent parameter being determined based on the at least one location and on the audio signal, determining at least one time-dependent parameter characterizing the audio signal emitted by the sound source based on the at least one time-independent parameter and on the audio signal, and separating the audio signal from a combination of audio signals based on the at least one time-independent parameter and on the at least one time-dependent parameter.

2. The method according to claim 1 , wherein determining the at least one time-independent parameter comprises determining an initial estimate of the at least one time-independent parameter, and comprises determining a final estimate of the at least one time-independent parameter using an expectation-maximization method, wherein determining the at least one time-dependent parameter comprises determining an initial estimate of the at least one time-dependent parameter and comprises determining a final estimate of the at least one time-dependent parameter using the expectation-maximization method, and wherein separating the audio signal from the combination of audio signals is based on the final estimate of the at least one time time-independent parameter and on the final estimate of the at least one time-dependent parameter.

3. The method according to claim 1 , wherein the at least one time-dependent parameter is a spatial covariance matrix, and wherein the at least one time-dependent parameter is a power spectrum.

4. A sound source separation device comprising: one or more processors; and memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for: determining at least one location of at least one sound source based on video data, determining initial estimates of at least two parameters characterizing an audio signal emitted by the sound source, the initial estimates of the at least two parameters being determined based on the at least one location, determining final estimates of the at least two parameters characterizing the audio signal using an expectation-maximization method; detecting a noise signal based on the video data, determining at least one parameter characterizing the noise signal, and separating the audio signal from a combination of audio signals based on the final estimates of the at least two parameters characterizing the audio signal and on the at least one parameter characterizing the noise signal.

5. The device according to claim 4 , wherein determining the initial estimates of the at least two parameters characterizing the audio signal comprises determining time-independent spatial parameters.

6. The device according to claim 4 , wherein determining the initial estimates of the at least two parameters characterizing the audio signal is part of an initialization of a sound propagation model.

7. The device according to claim 4 , wherein determining the initial estimates of the at least two parameters characterizing the audio signal comprises determining power spectra parameters.

8. The device according to claim 4 , wherein the video data comprises video surveillance data.

9. The device according to claim 4 , wherein determining the at least one parameter characterizing the noise signal includes determining initial estimates of at least two parameters characterizing the noise signal, wherein the expectation-maximization method is also performed for determining final estimates of the at least two parameters characterizing the noise signal, and wherein separating the audio signal from the combination of audio signals is based on the final estimates of the at least two parameters characterizing the noise signal.

10. The device according to claim 4 , wherein determining the at least one location of the at least one sound source based on the video data is performed using binary masking.

11. The device according to claim 4 , wherein the one or more programs include further instructions for: determining a first frequency spectrum and a first activity parameter for a first separated signal of the combination of audio signals, determining a second frequency spectrum and a second activity parameter for a second separated signal of the combination of audio signals, and removing interferences from the second separated signal based on the first and second frequency spectra and activity parameters, thereby obtaining an enhanced separated signal.

12. The device according to claim 11 , wherein the second separated signal is phase independent.

13. The device according to claim 11 , wherein the second separated signal is obtained from a processing of stereo signals from a microphone array.

14. The device according to claim 11 , wherein the second separated signal is obtained from an averaging of stereo signals.

15. The device according to claim 11 , wherein the one or more programs include further instructions for normalizing the first separated signal and the second separated signal.

16. The device according to claim 11 , wherein determining the first and second frequency spectra and activity parameters comprises applying a filter corresponding to the sound perception of a human ear.

17. The device according to claim 11 , wherein removing the interferences comprises an anomaly detection.

18. A non-transitory information storage means readable by a computer or a microprocessor storing a computer program that includes instructions for: determining at least one location of a sound source based on video data, determining at least one time-independent parameter characterizing an audio signal emitted by the sound source in the video data, the at least one time-independent parameter being determined based on the at least one location and on the audio signal, determining at least one time-dependent parameter characterizing the audio signal emitted by the sound source based on the at least one time-independent paramter and on the audio signal, and separating the audio signal from a combination of audio signals based on the at least one time-independent parameter and on the at least one time-dependent parameter.

19. The non-transitory information storage means of claim 18 , wherein the at least one time-dependent parameter is a spatial covariance matrix, and wherein the at least one time-dependent parameter is a power spectrum.

Patent Metadata

Filing Date

Unknown

Publication Date

September 18, 2018

Inventors

Johann Citerin

Gérald Kergourlay

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search