US-10964308

Speech processing apparatus, and program

PublishedMarch 30, 2021

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A speech processing apparatus is provided in which, while face feature points are extracted from moving image data obtained by imaging a speaker's face, for each frame, a first generation network for generating face feature points of the corresponding frame based on speech feature data extracted from uttered speech of the speaker for each frame is generated, and whether the first generation network is appropriate is evaluated using an identification network, then, a second generation network for generating the uttered speech from a plurality of uncertain settings including at least text representing utterance content of the uttered speech and information indicating emotions included in the uttered speech, a plurality of types of fixed settings which define speech quality, and the face feature points generated by the first generation network evaluated as appropriate, is generated, and whether the second generation network is appropriate is evaluated using the identification network.

Patent Claims

12 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A speech processing apparatus comprising: an extracting means configured to separate moving image data obtained by imaging a face of a speaker in an utterance period into frames having a predetermined time length, and extract face feature point data indicating positions of face feature points determined in advance, for each frame; a first generating means configured to separate speech data representing uttered speech of the speaker in the utterance period into the frames and generate a first generation network for generating face feature points of each frame from speech feature data of the corresponding frame; a first evaluating means configured to evaluate whether or not the first generation network is appropriate using the face feature point data extracted from each frame using a first identification network; a second generating means configured to cause a user to designate a plurality of types of uncertain settings including at least text representing utterance content of the uttered speech and information indicating emotions included in the uttered speech, cause the user to designate a plurality of types of fixed settings which define speech quality of the speaker, and generate a second generation network for generating the uttered speech from the face feature points generated by the first generation network evaluated as appropriate by the first evaluating means, the plurality of types of fixed settings and the plurality of types of uncertain settings designated by the user; and a second evaluating means configured to evaluate whether or not the second generation network is appropriate using the speech data using a second identification network.

2. The speech processing apparatus according to claim 1 , further comprising: a designation accepting means configured to encourage the user to designate fixed settings and uncertain settings for speech to be synthesized; and a speech synthesizing means configured to synthesize speech corresponding to the fixed settings and the uncertain settings designated for the designation accepting means using the second generation network evaluated as appropriate by the second evaluating means.

3. The speech processing apparatus according to claim 2 , wherein the designation accepting means displays a color map in which different colors are associated with respective emotions, on a display apparatus, and causes the user to designate emotions to be included in speech to be synthesized through designation of colors.

4. The speech processing apparatus according to claim 2 , wherein the designation accepting means accepts more designations of information indicating emotions as a character string length of text is longer.

5. The speech processing apparatus according to claim 1 , wherein the second generating means comprises: a single network generating means configured to generate the second generation network for each setting of the plurality of types of fixed settings and the plurality of types of uncertain settings; a multi-network generating means configured to generate the second generation network so that, for each combination of a plurality of settings except at least one setting among the plurality of types of fixed settings and the plurality of types of uncertain settings, each of the plurality of settings does not affect other settings; and an all-network generating means configured to generate the second generation network so that each of the plurality of types of fixed settings and the plurality of types of uncertain settings does not affect other settings.

6. A computer program product embodying computer readable instructions stored on a non-transitory computer-readable medium for causing a computer to execute the computer readable instructions by a processor so as to perform the steps of: separating moving image data obtained by imaging a face of a speaker in an utterance period into frames having a predetermined time length and extracting face feature point data indicating positions of face feature points determined in advance, for each of the frames; separating speech data representing uttered speech of the speaker in the utterance period into the frames and generating a first generation network for generating the face feature points of each of the frames from speech feature data of a corresponding frame of the frames; first evaluating whether or not the first generation network is appropriate using the face feature point data extracted from each of the frames using a first identification network; causing a user to designate a plurality of types of uncertain settings including at least text representing utterance content of the uttered speech and information indicating emotions included in the uttered speech, causing the user to designate a plurality of types of fixed settings which define speech quality of the speaker, and generating a second generation network for generating the uttered speech from the face feature points generated by the first generation network evaluated as appropriate in the first evaluation, the plurality of types of fixed settings and the plurality of types of uncertain settings designated by the user; and second evaluating whether or not the second generation network is appropriate using the speech data using a second identification network.

7. The computer program product according to claim 6 , for causing the computer to further execute the computer readable instructions by the processor so as to perform the steps of: encouraging the user to designate fixed settings and uncertain settings for speech to be synthesized; and synthesizing speech corresponding to the designated fixed settings and the designated uncertain settings by the user using the second generation network evaluated as appropriate in the second evaluation.

8. The speech processing apparatus according to claim 3 , wherein the designation accepting means accepts more designations of information indicating emotions as a character string length of text is longer.

9. The speech processing apparatus according to claim 2 , wherein the second generating means comprises: a single network generating means configured to generate the second generation network for each setting of the plurality of types of fixed settings and the plurality of types of uncertain settings; a multi-network generating means configured to generate the second generation network so that, for each combination of a plurality of settings except at least one setting among the plurality of types of fixed settings and the plurality of types of uncertain settings, each of the plurality of settings does not affect other settings; and an all-network generating means configured to generate the second generation network so that each of the plurality of types of fixed settings and the plurality of types of uncertain settings does not affect other settings.

10. The speech processing apparatus according to claim 3 , wherein the second generating means comprises: a single network generating means configured to generate the second generation network for each setting of the plurality of types of fixed settings and the plurality of types of uncertain settings; a multi-network generating means configured to generate the second generation network so that, for each combination of a plurality of settings except at least one setting among the plurality of types of fixed settings and the plurality of types of uncertain settings, each of the plurality of settings does not affect other settings; and an all-network generating means configured to generate the second generation network so that each of the plurality of types of fixed settings and the plurality of types of uncertain settings does not affect other settings.

11. The speech processing apparatus according to claim 4 , wherein the second generating means comprises: a single network generating means configured to generate the second generation network for each setting of the plurality of types of fixed settings and the plurality of types of uncertain settings; a multi-network generating means configured to generate the second generation network so that, for each combination of a plurality of settings except at least one setting among the plurality of types of fixed settings and the plurality of types of uncertain settings, each of the plurality of settings does not affect other settings; and an all-network generating means configured to generate the second generation network so that each of the plurality of types of fixed settings and the plurality of types of uncertain settings does not affect other settings.

12. The speech processing apparatus according to claim 8 , wherein the second generating means comprises: a single network generating means configured to generate the second generation network for each setting of the plurality of types of fixed settings and the plurality of types of uncertain settings; a multi-network generating means configured to generate the second generation network so that, for each combination of a plurality of settings except at least one setting among the plurality of types of fixed settings and the plurality of types of uncertain settings, each of the plurality of settings does not affect other settings; and an all-network generating means configured to generate the second generation network so that each of the plurality of types of fixed settings and the plurality of types of uncertain settings does not affect other settings.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G10L G06V G06N

Patent Metadata

Filing Date

October 29, 2018

Publication Date

March 30, 2021

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search