Synchronization of Lip Movement Images to Audio Voice Signal

PublishedApril 8, 2025

Assigneenot available in USPTO data we have

InventorsKyrylo Sydorchuk Volodymyr Cherniavskyi Stanislav Mihailevschii Oleh Vallas Ivan Shuhaienko+2 more

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method comprising: acquiring, by a computing device, a source video; dividing, by the computing device, the source video into a set of image frames and a set of audio frames; generating, by the computing device, a vector database based on the set of image frames and the set of audio frames, wherein a vector of the vector database includes a face vector and an audio vector, the face vector being determined based on an image frame of the set of image frames and the audio vector being determined based an audio frame of the set of audio frames, the audio frame corresponding to the image frame; receiving, by the computing device, a target image frame and a target audio frame, the target audio frame being selected from a target audio record, wherein the source video is a first record and the target audio record is a second record, the second record being different from the first record; determining, by the computing device, a target image vector based on the target image frame and a target audio vector based on the target audio frame; searching, by the computing device, the vector database to select a pre-determined number of vectors corresponding to the target image vector and the target audio frame; and generating, by the computing device and based on the pre-determined number of vectors, an output image frame of an output video, the output video being a third record, the third record being different from the first record.

2. The method of claim 1, wherein the acquiring the source video includes capturing, by the computing device, a video featuring a user.

3. The method of claim 1, wherein the face vector and the target image vector are generated by a vocabulary encoder including a pre-trained neural network.

4. The method of claim 1, wherein the face vector includes an angle of a rotation of a face in the image frame around an axis.

5. The method of claim 1, wherein the audio vector and the target audio vector are generated by a speech encoder including a pre-trained neural network.

6. The method of claim 1, wherein the selection of the pre-determined number of vectors includes: determining a first metric based on the target image vector and the face vector; determining a second metric based on the target audio vector and the audio vector; and combining the first metric and the second metric into a third metric; and determining that the third metric is below a predetermined threshold.

7. The method of claim 6, wherein: the first metric includes a distance between the target image vector and the face vector; and the second metric includes a scaled dot product of the target audio vector and the audio vector.

8. The method of claim 1, further comprising, prior to generating the output image frame, extracting style information from the set of image frames, wherein: the style information indicates a presence or an absence of an emotional expression in a face in the image frames of the set of image frames; and the output image frame is generated based on the style information.

9. The method of claim 8, wherein the output image frame is generated by a decoder including a pre-trained neural network.

10. The method of claim 8, wherein the target image frame is selected from the set of image frames.

11. A computing device comprising: a processor; and a memory storing instructions that, when executed by the processor, configure the computing device to: acquire, by the computing device, a source video; divide, by the computing device, the source video into a set of image frames and a set of audio frames; generate, by the computing device, a vector database based on the set of image frames and the set of audio frames, wherein a vector of the vector database includes a face vector and an audio vector, the face vector being determined based on an image frame of the set of image frames and the audio vector being determined based an audio frame of the set of audio frames, the audio frame corresponding to the image frame; receive, by the computing device, a target image frame and a target audio frame, the target audio frame being selected from a target audio record, wherein the source video is a first record and the target audio record is a second record, the second record being different from the first record; determine, by the computing device, a target image vector based on the target image frame and a target audio vector based on the target audio frame; search, by the computing device, the vector database to select a pre-determined number of vectors corresponding to the target image vector and the target audio frame; and generate, by the computing device and based on the pre-determined number of vectors, an output image frame of an output video, the output video being a third record, the third record being different from the first record.

12. The computing device of claim 11, wherein the acquiring the source video includes capturing, by the computing device, a video featuring a user.

13. The computing device of claim 11, wherein the face vector and the target image vector are generated by a vocabulary encoder including a pre-trained neural network.

14. The computing device of claim 11, wherein the face vector includes an angle of a rotation of a face in the image frame around an axis.

15. The computing device of claim 11, wherein the audio vector and the target audio vector are generated by a speech encoder including a pre-trained neural network.

16. The computing device of claim 11, wherein the selection of the pre-determined number of vectors includes: determining a first metric based on the target image vector and the face vector; determining a second metric based on the target audio vector and the audio vector; and combining the first metric and the second metric into a third metric; and determining that the third metric is below a predetermined threshold.

17. The computing device of claim 16, wherein: the first metric includes a distance between the target image vector and the face vector; and the second metric includes a scaled dot product of the target audio vector and the audio vector.

18. The computing device of claim 11, wherein the instructions further configure the computing device to, prior to generating the output image frame, extract style information from the set of image frames, wherein: the style information indicates a presence or an absence of an emotional expression in a face in the image frames of the set of image frames; and the output image frame is generated based on the style information.

19. The computing device of claim 18, wherein the output image frame is generated by a decoder including a pre-trained neural network.

20. A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that, when executed by a computing device, cause the computing device to: acquire a source video; divide the source video into a set of image frames and a set of audio frames; generate a vector database based on the set of image frames and the set of audio frames, wherein a vector of the vector database includes a face vector and an audio vector, the face vector being determined based on an image frame of the set of image frames and the audio vector being determined based an audio frame of the set of audio frames, the audio frame corresponding to the image frame; receive a target image frame and a target audio frame, the target audio frame being selected from a target audio record, wherein the source video is a first record and the target audio record is a second record, the second record being different from the first record; determine a target image vector based on the target image frame and a target audio vector based on the target audio frame; search the vector database to select a pre-determined number of vectors corresponding to the target image vector and the target audio frame; and generate, based on the pre-determined number of vectors, an output image frame of an output video, the output video being a third record, the third record being different from the first record.

Patent Metadata

Filing Date

Unknown

Publication Date

April 8, 2025

Inventors

Kyrylo Sydorchuk

Volodymyr Cherniavskyi

Stanislav Mihailevschii

Oleh Vallas

Ivan Shuhaienko

Daniil Krasylnikov

Yurii Astafiev

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search