Systems and methods for adapting human speaker embeddings in speech synthesis

PublishedMarch 12, 2024

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Novel methods and systems for adapting a voice cloning synthesizer for a new speaker using real speech data are disclosed. Utterances from one or more target speakers are parameterized and are used to initialize an embedding vector for use with a voice synthesizer, by means of clustering the utterance data and determining the centroid of the data, using a speaker identification neural network, and/or by finding the closest stored embedded vector to the utterance data.

Patent Claims

14 claims

Legal claims defining the scope of protection, as filed with the USPTO.

3. The method of claim 2, wherein the voice identification system is a neural network.

6. The method of claim 4, wherein each cluster has a threshold distance from its centroid and the adapting further comprises fine-tuning based on the at least one embedding vector of the target style in the threshold distance.

7. The method of claim 4, wherein the speech synthesizer is a neural network.

8. The method of claim 4, wherein extracting features further comprises combining sample embedding vectors extracted from window samples of a waveform of the at least one waveform to produce an embedding vector for the waveform.

9. The method of claim 8, wherein the combining comprises averaging the sample embedding vectors.

10. The method of claim 4, wherein the input is from a film or video source.

11. The method of claim 4, wherein the target style comprises a speaking style of a target person.

12. The method of claim 11, wherein the target style further comprises at least one of age, accent, emotion, and acting role.

13. The method of claim 11, wherein the target person is an actor and the target style is the target person at an age younger than their current age.

15. The method of claim 14, further comprising determining an expected number of clusters prior to the clustering, wherein the clustering is based on the expected number of clusters.

16. The method of claim 15, wherein the determining an expected number of clusters uses a statistical analysis of the input.

17. The method of claim 4, further comprising updating a voice synthesizer table with the initial embedding vector.

18. A non-transitory computer readable medium configured to perform on a computer the method of claim 4.

19. A device configured to perform the method of claim 4.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G10L

Patent Metadata

Filing Date

August 18, 2020

Publication Date

March 12, 2024

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search