US-7844463

Method and system for aligning natural and synthetic video to speech synthesis

PublishedNovember 30, 2010

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

According to MPEG-4's TTS architecture, facial animation can be driven by two streams simultaneously—text and Facial Animation Parameters. A Text-To-Speech converter drives the mouth shapes of the face. An encoder sends Facial Animation Parameters to the face. The text input can include codes, or bookmarks, transmitted to the Text-to-Speech converter, which are placed between and inside words. The bookmarks carry an encoder time stamp. Due to the nature of text-to-speech conversion, the encoder time stamp does not relate to real-world time, and should be interpreted as a counter. The Facial Animation Parameter stream carries the same encoder time stamp found in the bookmark of the text. The system reads the bookmark and provides the encoder time stamp and a real-time time stamp. The facial animation system associates the correct facial animation parameter with the real-time time stamp using the encoder time stamp of the bookmark as a reference.

Patent Claims

15 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of aligning video with audio, the method comprising: identifying a predetermined code associated with an animation mimic in a first stream, wherein the predetermined code comprises an escape sequence followed by a plurality of bits, which define one of a set of possible animation mimics; and transmitting the predetermined code within a second stream to thereby synchronize the second stream with the first stream.

2. The method of claim 1 , wherein the first stream is an animation mimics stream and the second stream is a text stream.

3. The method of claim 2 , further comprising encoding the first stream containing the animation mimic and the text stream containing the predetermined code.

4. The method of claim 1 , wherein the animation mimic is a facial mimic.

5. The method of claim 1 , further comprising placing the predetermined code in between words in the second stream.

6. A system for aligning video with audio, the system comprising: a processor; a module configured to control the processor to identify a predetermined code associated with an animation mimic in a first stream, wherein the predetermined code comprises an escape sequence followed by a plurality of bits, which define one of a set of possible animation mimics; and a module configured to control the processor to transmit the predetermined code within a second stream to thereby synchronize the second stream with the first stream.

7. The system for claim 6 , wherein the first stream is an animation mimics stream and the second stream is a text stream.

8. The system for claim 7 , further comprising a module configured to control the processor to encode the first stream containing the animation mimic and the text stream containing the predetermined code.

9. The system for claim 6 , wherein the animation mimic is a facial mimic.

10. The system for claim 6 , further comprising a module configured to control the processor to place the predetermined code in between words in the second stream.

11. A computer-readable medium storing instructions for controlling a computing device to align a video with audio, the instructions comprising: identifying a predetermined code associated with an animation mimic in a first stream, wherein the predetermined code comprises an escape sequence followed by a plurality of bits, which define one of a set of possible animation mimics; and transmitting the predetermined code within a second stream to thereby synchronize the second stream with the first stream.

12. The computer-readable medium of claim 11 , wherein the first stream is an animation mimics stream and the second stream is a text stream.

13. The computer-readable medium of claim 12 , further comprising encoding the first stream containing the animation mimic and the text stream containing the predetermined code.

14. The computer-readable medium of claim 11 , wherein the animation mimic is a facial mimic.

15. The computer-readable medium of claim 11 , further comprising placing the predetermined code in between words in the second stream.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06T G10L H04N

Patent Metadata

Filing Date

August 18, 2008

Publication Date

November 30, 2010

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search