Hybrid Approach in Voice Conversion

PublishedJuly 17, 2012

Assigneenot available in USPTO data we have

InventorsJilei Tian Victor Popa Jani Kristian Nurminen

Technical Abstract

Patent Claims

42 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method comprising: processing a set of source sounds to generate a source feature vector and processing a set of target sounds to generate a target feature vector; aligning the source feature vector with the target feature vector to generate a joint variable; estimating a probability density function for the joint variable, the probability density function including a mean vector; and training, by one or more processors, a mixture model based on the joint variable by a process that includes: selecting a mixture mean pair from the mean vector, deriving a source spectral envelope and a target spectral envelope for the selected mixture mean pair, and generating a mixture specific warping function for the selected mixture mean pair based on the target and source spectral envelopes.

2. The method of claim 1 , further comprising: receiving a source sound; applying linear prediction to the source sound to generate a second source feature vector; calculating a mixture weight for the second source feature vector; and generating a warped feature vector by applying a function to the second source feature vector, the function including the mixture weight, the mixture specific warping function for the mixture mean pair, and other mixture specific warping functions for other mixture mean pairs selected from the mean vector.

3. The method of claim 1 , wherein the set of source sounds is divided into a plurality of source segments and the set of target sounds is divided into a plurality of target segments, wherein aligning the source feature vector with the target feature vector comprises aligning source parameters derived from a first source segment with target parameters derived from a target segment of a corresponding acoustic event.

4. The method of claim 1 , wherein generating the mixture specific warping function for the mixture mean pair includes: identifying one or more first peaks from the source spectral envelope; identifying one or more second peaks from the target spectral envelope; identifying a set of nodes representing possible aligned formant pairings of the source spectral envelope with the target spectral envelope, each node of the set of nodes being located at an intersection between a peak from the one or more first peaks and a peak from the one or more second peaks.

5. The method of claim 4 , wherein generating the mixture specific warping function for the mixture mean pair includes: identifying one or more paths based on the set of nodes; calculating a node cost for each node in the set of nodes; for each of the one or more paths, calculating a path cost based on a sum of node costs that correspond to nodes along the path; and selecting a particular path from the one or more paths based on path costs of the one or more paths.

6. The method of claim 5 , wherein generating the mixture specific warping function for the mixture mean pair includes applying curve fitting to the nodes along the particular path to derive the mixture specific warping function for the mixture mean pair.

7. The method of claim 1 , wherein each of the source feature vector and the target feature vector comprise at least one of a line spectral frequency coefficient, energy information, amplitude information, pitch information, and voicing information.

8. The method of claim 1 , wherein processing the set of source sounds and processing the set of target sounds generates a line spectral frequency representation of the set of source sounds and the set of target sounds.

9. The method of claim 1 , wherein training the mixture model based on the joint variable includes generating a plurality of mixture specific warping functions, each warping function in the plurality of mixture specific warping functions corresponding to a specific mixture mean pair from the mean vector, and wherein one of the warping functions in the plurality of mixture specific warping functions is the mixture specific warping function for the mixture mean pair.

10. An apparatus comprising: one or more processors; and one or more non-transitory computer readable media storing computer readable instructions configured to, with the one or more processors, cause the apparatus to at least: process a set of source sounds to generate a source feature vector and process a set of target sounds to generate a target feature vector; align the source feature vector with the target feature vector to generate a joint variable; estimate a probability density function for the joint variable, the probability density function including a mean vector; and train a mixture model based on the joint variable by a process that includes: selecting a mixture mean pair from the mean vector, deriving a source spectral envelope and a target spectral envelope for the selected mixture mean pair, and generating a mixture specific warping function for the selected mixture mean pair based on the target and source spectral envelopes.

11. The apparatus of claim 10 , wherein the one or more computer readable media further store computer readable instructions configured to, with the one or more processors, cause the apparatus to: receive a source sound; apply linear prediction to the source sound to generate a second source feature vector; calculate a mixture weight for the second source feature vector; and generate a warped feature vector by applying a function to the second source feature vector, the function including the mixture weight, the mixture specific warping function for the mixture mean pair, and other mixture specific warping functions for other mixture mean pairs selected from the mean vector.

12. The apparatus of claim 10 , wherein the set of source sounds is divided into a plurality of source segments and the set of target sounds is divided into a plurality of target segment's, wherein aligning the source feature vector with the target feature vector comprises aligning source parameters derived from a first source segment with target parameters derived from a target segment of a corresponding acoustic event.

13. The apparatus of claim 10 , wherein generating the mixture specific warping function for the mixture mean pair includes: identifying one or more first peaks from the source spectral envelope; identifying one or more second peaks from the target spectral envelope; identifying a set of nodes representing possible aligned formant pairings of the source spectral envelope with the target spectral envelope, each node of the set of nodes being located at an intersection between a peak from the one or more first peaks and a peak from the one or more second peaks.

14. The apparatus of claim 13 , wherein generating the mixture specific warping function for the mixture mean pair includes: identifying one or more paths based on the set of nodes; calculating a node cost for each node in the set of nodes; for each of the one or more paths, calculating a path cost based on a sum of node costs that correspond to nodes along the path; and selecting a particular path from the one or more paths based on path costs of the one or more paths.

15. The apparatus of claim 14 , wherein generating the mixture specific warping function for the mixture mean pair includes applying curve fitting to the nodes along the particular path to derive the mixture specific warping function for the mixture mean pair.

16. The apparatus of claim 10 , wherein each of the source feature vector and the target feature vector comprise at least one of a line spectral frequency coefficient, energy information, amplitude information, pitch information, and voicing information.

17. The apparatus of claim 10 , wherein processing the set of source sounds and processing the set of target sounds generates a line spectral frequency representation of the set of source sounds and the set of target sounds.

18. The apparatus of claim 10 , wherein training the mixture model based on the joint variable includes generating a plurality of mixture specific warping functions, each warping function in the plurality of mixture specific warping functions corresponding to a specific mixture mean pair from the mean vector, and wherein one of the warping functions in the plurality of mixture specific warping functions is the mixture specific warping function for the mixture mean pair.

19. One or more non-transitory computer readable media storing computer readable instructions configured to, when executed, cause a processor to at least: process a set of source sounds to generate a source feature vector and process a set of target sounds to generate a target feature vector; align the source feature vector with the target feature vector to generate a joint variable; estimate a probability density function for the joint variable, the probability density function including a mean vector; and train a mixture model based on the joint variable by a process that includes: selecting a mixture mean pair from the mean vector, deriving a source spectral envelope and a target spectral envelope for the selected mixture mean pair, and generating a mixture specific warping function for the selected mixture mean pair based on the target and source spectral envelopes.

20. The one or more computer readable media of claim 19 , further storing computer executable instructions configured to, when executed, cause the processor to: receive a source sound; apply linear prediction to the source sound to generate a second source feature vector; calculate a mixture weight for the second source feature vector; and generate a warped feature vector by applying a function to the second source feature vector, the function including the mixture weight, the mixture specific warping function for the mixture mean pair, and the other mixture specific warping functions for other mixture mean pairs selected from the mean vector.

21. The one or more computer readable media of claim 19 , wherein the set of source sounds is divided into a plurality of source segments and the set of target sounds is divided into a plurality of target segments, wherein aligning the source feature vector with the target feature vector comprises aligning source parameters derived from a first source segment with target parameters derived from a target segment of a corresponding acoustic event.

22. The one or more computer readable media of claim 19 , wherein generating the mixture specific warping function for the mixture mean pair includes: identifying one or more first peaks from the source spectral envelope; identifying one or more second peaks from the target spectral envelope; identifying a set of nodes representing possible aligned formant pairings of the source spectral envelope with the target spectral envelope, each node of the set of nodes being located at an intersection between a peak from the one or more first peaks and a peak from the one or more second peaks.

23. The one or more computer readable media of claim 22 , wherein generating the mixture specific warping function for the mixture mean pair includes: identifying one or more paths based on the set of nodes; calculating a node cost for each node in the set of nodes; for each of the one or more paths, calculating a path cost based on a sum of node costs that correspond to nodes along the path; and selecting a particular path from the one or more paths based on path costs of the one or more paths.

24. The one or more computer readable media of claim 23 , wherein generating the mixture specific warping function for the mixture mean pair includes applying curve fitting to the nodes along the particular path to derive the mixture specific warping function for the mixture mean pair.

25. The one or more computer readable media of claim 19 , wherein each of the source feature vector and the target feature vector comprise at least one of a line spectral frequency coefficient, energy information, amplitude information, pitch information, and voicing information.

26. The one or more computer readable media of claim 19 , wherein processing the set of source sounds and processing the set of target sounds generates a line spectral frequency representation of the set of source sounds and the set of target sounds.

27. The one or more computer readable media of claim 19 , wherein training the mixture model based on the joint variable includes generating a plurality of mixture specific warping functions, each warping function in the plurality of mixture specific warping functions corresponding to a specific mixture mean pair from the mean vector, and wherein one of the warping functions in the plurality of mixture specific warping functions is the mixture specific warping function for the mixture mean pair.

28. A method comprising: receiving a sound; applying linear prediction to the sound to generate a feature vector; providing a plurality of mixture specific warping functions, each warping function in the plurality of mixture specific warping functions being specific to one mixture mean pair from a mean vector of a probability density function and being generated based on target and source spectral envelopes derived from the specific mixture mean pair, wherein the probability density function is for a source speaker and a target speaker; calculating a mixture weight for the feature vector; and generating, by one or more processors, a warped feature vector by applying a function to the feature vector, the function including the mixture weight and the plurality of mixture specific functions, wherein a second sound generated based on the warped feature vector approximates a target sound from the target speaker.

29. The method of claim 28 , wherein the method further comprises: creating a linear prediction coefficient vector based on the feature vector; and calculating a spectral envelope of the linear prediction coefficient vector.

30. The method of claim 29 , wherein the warping function is applied to the spectral envelope to generate a warped spectral envelope.

31. The method of claim 30 , further comprising: deriving a warped linear prediction coefficient vector from the warped spectral envelope; converting the warped linear prediction coefficient vector to the warped feature vector; and generating sound based on the warped feature vector.

32. The method of claim 31 , further comprising: generating a warped spectral envelope estimate based on the warped linear prediction coefficient vector; and calculating a residual spectrum based on a difference between the warped spectral envelope and the warped spectral envelope estimate.

33. An apparatus comprising: one or more processors; and one or more non-transitory computer readable media storing computer readable instructions configured to, with the one or more processors, cause the apparatus to at least: receive a sound; apply linear prediction to the sound to generate a feature vector; provide a plurality of mixture specific warping functions, each warping function in the plurality of mixture specific warping functions being specific to one mixture mean pair from a mean vector of a probability density function and being generated based on target and source spectral envelopes derived from the specific mixture mean pair, wherein the probability density function is for a source speaker and a target speaker; calculate a mixture weight for the feature vector; and generate a warped feature vector by applying a function to the feature vector, the function including the mixture weight and the plurality of mixture specific warping functions, wherein a second sound generated based on the warped feature vector approximates a target sound from the target speaker.

34. The apparatus of claim 33 , wherein the one or more computer readable media further store computer readable instructions configured to, with the one or more processors, cause the apparatus to: create a linear prediction coefficient vector based on the feature vector; and calculate a spectral envelope of the linear prediction coefficient vector.

35. The apparatus of claim 34 , wherein the warping function is applied to the spectral envelope to generate a warped spectral envelope.

36. The apparatus of claim 35 , wherein the one or more computer readable media further store computer readable instructions configured to, with the one or more processors, cause the apparatus to: derive a warped linear prediction coefficient vector from the warped spectral envelope; convert the warped linear prediction coefficient vector to the warped feature vector; and generate sound based on the warped feature vector.

37. The apparatus of claim 36 , wherein the one or more computer readable media further store computer readable instructions configured to, with the one or more processors, cause the apparatus to: generate a warped spectral envelope estimate based on the warped linear prediction coefficient vector; and calculate a residual spectrum based on a difference between the warped spectral envelope and the warped spectral envelope estimate.

38. One or more non-transitory computer readable media storing computer readable instructions configured to, when executed, cause a processor to at least: receive a sound; apply linear prediction to the sound to generate a feature vector; provide a mixture model comprising a plurality of mixture specific warping functions, each warping function in the plurality of mixture specific warping functions being specific to one mixture mean pair from a mean vector of a probability density function and being generated based on target and source spectral envelopes derived from the specific mixture mean pair, wherein the probability density function is for a source speaker and a target speaker; calculate a mixture weight for the feature vector; and generate a warped feature vector by applying a function to the feature vector, the function including the mixture weight and the plurality of mixture specific warping functions, wherein a second sound generated based on the warped feature vector approximates a target sound from the target speaker.

39. The one or more computer readable media of claim 38 , further storing computer readable instructions configured to, when executed, cause the processor to: create a linear prediction coefficient vector based on the feature vector; and calculate a spectral envelope of the linear prediction coefficient vector.

40. The one or more computer readable media of claim 39 , wherein the warping function is applied to the spectral envelope to generate a warped spectral envelope.

41. The one or more computer readable media of claim 40 , further storing computer readable instructions configured to, when executed, cause the processor to: derive a warped linear prediction coefficient vector from the warped spectral envelope; convert the warped linear prediction coefficient vector to the warped feature vector; and generate sound based on the warped feature vector.

42. The one or more computer readable media of claim 41 , further storing computer readable instructions configured to, when executed, cause the processor to: generate a warped spectral envelope estimate based on the warped linear prediction coefficient vector; and calculate a residual spectrum based on a difference between the warped spectral envelope and the warped spectral envelope estimate.

Patent Metadata

Filing Date

Unknown

Publication Date

July 17, 2012

Inventors

Jilei Tian

Victor Popa

Jani Kristian Nurminen

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search