This specification describes a computer-implemented method of generating context-dependent speech audio in a video game. The method comprises obtaining contextual information relating to a state of the video game. The contextual information is inputted into a prosody prediction module. The prosody prediction module comprises a trained machine learning model which is configured to generate predicted prosodic features based on the contextual information. Input data comprising the predicted prosodic features and speech content data associated with the state of the video game is inputted into a speech audio generation module. An encoded representation of the speech content data dependent on the predicted prosodic features is generated using one or more encoders of the speech audio generation module. Context-dependent speech audio is generated, based on the encoded representation, using a decoder of the speech audio generation module.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A computer-implemented method of generating context-dependent speech audio in a video game, the method comprising: enabling, by at least one processor of a computing device, gameplay of the video game; determining, by a video game engine of the video game on the at least one processor, an in-game event for which context-dependent speech audio is to be generated during the gameplay of the video game, wherein the in-game event includes an action performed by a character of the video game; obtaining, by the video game engine of the video game, contextual information and speech content data relating to a current state of the gameplay; requesting, by the video game engine of the video game, the context-dependent speech audio from a speech audio generator of the video game; generating, by the speech audio generator responsive to the request, the context-dependent speech audio by: inputting the contextual information relating to the current state of the gameplay into a prosody prediction model, wherein the prosody prediction model comprises a trained machine learning model which is configured to generate predicted prosodic features based on the contextual information; generating, by the prosody prediction model, predicted prosodic features from the input contextual information; inputting, into a speech audio generation model, input data comprising: at least the predicted prosodic features; and the speech content data relating to the current state of the gameplay; generating, using one or more encoders of the speech audio generation model, an encoded representation of the speech content data dependent on the predicted prosodic features; decoding, using a decoder of the speech audio generation model, the encoded representation to generate the context-dependent speech audio; and causing, by the video game engine of the video game, the context-dependent speech audio that matches the current state of the video game to be played among the gameplay of the in-game event.
2. The computer-implemented method of claim 1, wherein the one or more encoders comprise a prosody encoder configured to generate an encoded representation of the predicted prosodic features, and a speech content encoder configured to generate the encoded representation of the speech content data based on the encoded representation of the predicted prosodic features.
3. The computer-implemented method of claim 1, wherein the video game is a sports video game, wherein obtaining the contextual information relating to the current state of the video game comprises determining contextual information relating to an in-progress match of the sports video game.
4. The computer-implemented method of claim 3, wherein the contextual information relating to the in-progress match of the sports video game comprises determining one or more of: statistics relating to one or more teams playing in the match; statistics relating to one or more players playing in the match; statistics relating to a current status of the match; and the type of sport being played in the match.
5. The computer-implemented method of claim 1, wherein the contextual information includes the speech content data associated with the current state of the video game.
6. The computer-implemented method of claim 1, wherein the input data further comprises speaker identifier data for a speaker of the generated speech audio.
7. A non-transitory computer-readable medium containing instructions, which when executed by one or more processors, causes the one or more processors to perform a method comprising: enabling, by at least one processor of a computing device, gameplay of a video game; determining, by a video game engine of the video game on the at least one processor, an in-game event for which context-dependent speech audio is to be generated during the gameplay of the video game, wherein the in-game event includes an action performed by a character of the video game; obtaining, by the video game engine of the video game, contextual information and speech content data relating to a current state of the gameplay; requesting, by the video game engine of the video game, the context-dependent speech audio from a speech audio generator of the video game; generating, by the speech audio generator responsive to the request, the context-dependent speech audio by: inputting the contextual information relating to the current state of the gameplay into a prosody prediction model, wherein the prosody prediction model comprises a trained machine learning model which is configured to generate predicted prosodic features based on the contextual information; generating, by the prosody prediction model, predicted prosodic features from the input contextual information; inputting, into a speech audio generation model, input data comprising: at least the predicted prosodic features; and the speech content data relating to the current state of the gameplay; generating, using one or more encoders of the speech audio generation model, an encoded representation of the speech content data dependent on the predicted prosodic features; decoding, using a decoder of the speech audio generation model, the encoded representation to generate the context-dependent speech audio; and causing, by the video game engine of the video game, the context-dependent speech audio that matches the current state of the video game to be played among the gameplay of the in-game event.
8. The non-transitory computer-readable medium of claim 7, wherein the speech audio generation model includes a synthesizer.
9. The non-transitory computer-readable medium of claim 8, wherein the speech content data comprises a plurality of speech content segments at a plurality of respective time steps and wherein inputting, into the speech audio generation model, the input data comprising the predicted prosodic features and the speech content data comprises generating, as output of a speech content encoder of the synthesizer, a speech content encoding for each time step of one or more time steps of the speech content data.
10. The non-transitory computer-readable medium of claim 9, wherein generating predicted prosodic features comprises generating predicted prosodic features for each time step of the one or more time steps of the speech content data.
11. The non-transitory computer-readable medium of claim 10, wherein inputting, into the speech audio generator, the input data comprising the predicted prosodic features and the speech content data comprises combining, for each time step of the one or more time steps, the speech content encoding and the predicted prosodic features of the time step.
12. A computer-implemented method of generating context-dependent speech audio in a video game, the method comprising: enabling, by at least one processor of a computing device, gameplay of the video game comprising requesting, by the at least one processor of the computing device, video game content from a video game server while a user is playing the video game; determining, by a video game engine of the video game on the at least one processor, an in-game event for which context-dependent speech audio is to be generated during the gameplay of the video game; obtaining, by the video game engine of the video game, contextual information and speech content data relating to a current state of the gameplay; requesting, by the video game engine of the video game, the context-dependent speech audio from a speech audio generator of the video game; generating, by the speech audio generator responsive to the request, the context-dependent speech audio based upon processing the contextual information and speech content data relating to the current state of the gameplay by one or more machine learning models; and causing, by the video game engine of the video game, the context-dependent speech audio that matches the current state of the video game to be played among the gameplay of the in-game event.
13. The computer-implemented method of claim 12, wherein generating, by the speech audio generator responsive to the request, the context-dependent speech audio based upon processing the contextual information and speech content data relating to the current state of the gameplay by the one or more machine learning models comprises: generating, using the one or more machine learning models, predicted prosodic features based upon the contextual information.
14. The computer-implemented method of claim 13, wherein generating, by the speech audio generator responsive to the request, the context-dependent speech audio based upon processing the contextual information and speech content data relating to the current state of the gameplay by the one or more machine learning models comprises: generating, using the one or more machine learning models, the speech content data based upon the predicted prosodic features.
15. The computer-implemented method of claim 12, wherein the video game is a sports video game, wherein obtaining the contextual information relating to the current state of the video game comprises determining contextual information relating to an in-progress match of the sports video game.
16. The computer-implemented method of claim 15, wherein the contextual information relating to the in-progress match of the sports video game comprises determining one or more of: statistics relating to one or more teams playing in the match; statistics relating to one or more players playing in the match; statistics relating to a current status of the match; and the type of sport being played in the match.
17. The computer-implemented method of claim 12, wherein the contextual information includes the speech content data associated with the current state of the video game.
18. The computer-implemented method of claim 12, wherein an input data further comprises speaker identifier data for a speaker of the generated speech audio.
19. A non-transitory computer-readable medium containing instructions, which when executed by one or more processors, causes the one or more processors to perform a method comprising: enabling, by at least one processor of a computing device, gameplay of a video game comprising requesting, by the at least one processor of the computing device, video game content from a video game server while a user is playing the video game; determining, by a video game engine of the video game on the at least one processor, an in-game event for which context-dependent speech audio is to be generated during the gameplay of the video game; obtaining, by the video game engine of the video game, contextual information and speech content data relating to a current state of the gameplay; requesting, by the video game engine of the video game, the context-dependent speech audio from a speech audio generator of the video game; generating, by the speech audio generator responsive to the request, the context-dependent speech audio based upon processing the contextual information and speech content data relating to the current state of the gameplay by one or more machine learning models; and causing, by the video game engine of the video game, the context-dependent speech audio that matches the current state of the video game to be played among the gameplay of the in-game event.
20. The non-transitory computer-readable medium of claim 19, wherein generating, by the speech audio generator responsive to the request, the context-dependent speech audio based upon processing the contextual information and speech content data relating to the current state of the gameplay by the one or more machine learning models comprises: generating, using the one or more machine learning models, predicted prosodic features based upon the contextual information.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
January 9, 2024
May 13, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.