System and Method for Detecting Fabricated Videos

PublishedJune 3, 2025

Assigneenot available in USPTO data we have

InventorsTrisha Mittal Uttaran Bhattacharya Rohan Chandra Aniket Bera Dinesh Manocha

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. An apparatus comprising: a first feature extraction module to receive visual content of a video and produce facial features therefrom, the facial features including facial modalities and facial affective cues, including facial emotions; a second feature extraction module to receive audio content of the video and produce speech features therefrom, the speech features including speech modalities and speech affective cues, including speech emotions; a neural network including: a first network responsive to the facial modalities to produce a facial modality embedding of the facial modalities; a second network responsive to the speech modalities to produce a speech modality embedding of the speech modalities; a third network responsive to the facial affective cues to produce an embedding of the facial affective cues, including a facial emotion embedding of the facial emotions; and a fourth network responsive to the speech affective cues to produce an embedding of the speech affective cues, including a speech emotion embedding of the speech emotions; a comparison module to determine a first measure of a similarity between the facial modality embedding and the speech modality embedding and further to determine a second measure of a similarity between the embedding of the facial affective cues and the embedding of the speech affective cues, where the first measure of the similarity comprises a first distance between the facial modality embedding and the speech modality embedding and the second measure of the similarity comprises a second distance between the embedding of the facial affective cues, including the facial emotion embedding of the facial emotions, and the embedding of the speech affective cues, including the speech emotion embedding of the speech emotions; and a classification module to determine the video to be real or fake dependent upon the first and second measures of similarity, where the classification module is configured to classify the video as fake when a sum of the first distance and the second distance exceeds a threshold distance and to classify the video as real when the sum of the first distance and the second distance does not exceed the threshold distance.

2. The apparatus of claim 1, where the third network includes a memory fusion network and the fourth network includes a memory fusion network.

3. The apparatus of claim 1, where the facial and speech affective cues are determined emotions, including one or more of ‘happy’, ‘sad’, ‘angry’, ‘fearful’, ‘surprise’, ‘disgust’, and ‘neutral’ emotions.

4. The apparatus of claim 1, where the first and second networks include one or more of: a two-dimensional convolution layer; a max-pooling layer; a fully-connected layer; and a normalization layer.

5. The apparatus of claim 1, where the first and second networks further include rectified linear unit (ReLU) activation functions implemented between layers of the networks.

6. The apparatus of claim 1, where the facial features include one or more of two-dimensional landmark positions, head pose orientation, and gaze.

7. The apparatus of claim 1, where the speech features include Mel frequency cepstral coefficients.

8. A computer-implemented method for classifying a video, the method comprising: obtaining facial features extracted from visual content of the video, the facial features including facial modalities and facial emotions; obtaining speech features extracted from audio content of the video, the speech features including speech modalities and speech emotions; passing the facial modalities through a first neural network (F1) to generate a facial modality embedding (mf) of the facial modalities; passing the speech modalities through a second neural network (S1) to generate a speech modality embedding (ms) of the speech modalities; passing the facial emotions through a third neural network (F2) to generate a facial emotion embedding (ef) of the facial emotions; passing the speech emotions through a fourth neural network (S2) to generate a speech emotion embedding (es) of the speech emotions; generating a first distance d1(mf, ms) between the facial modality embedding and the speech modality embedding; generating a second distance d2 (ef, es) between the facial emotion embedding and the speech emotion embedding; classifying the video as fake when a sum of the first distance and the second distance exceeds a threshold distance; and classifying the video as real when the sum of the first distance and the second distance does not exceed the threshold distance.

9. The computer-implemented method of claim 8, where the facial modalities include at least one of two-dimensional landmark positions, head pose orientation, and gaze.

10. The computer-implemented method of claim 8, where the speech modalities include Mel frequency cepstral coefficients.

11. The computer-implemented method of claim 8, further comprising training the first, second, third, and fourth neural networks using real and fake videos to maximize the distance between a facial and a speech embedding of real and fake videos and minimize the distance between a facial and a speech embedding of real videos.

12. The computer-implemented method of claim 8, where the facial and speech emotions include one or more of ‘happy’, ‘sad’, ‘angry’, ‘fearful’, ‘surprise’, ‘disgust’, and ‘neutral’ emotions.

13. The computer-implemented method of claim 8, further comprising training the first, second, third, and fourth neural networks, including: generating, from a real video of a first subject: a real facial modality embedding (mrealf); a real speech modality embedding (mreals); a real facial emotion embedding (erealf); and a real speech emotion embedding (ereals); generating, from a fake video of the first subject: a fake facial modality embedding (mfakef); a fake facial emotion embedding (efakef); and a fake speech emotion embedding (efakes) determining a first similarity loss ρ1=max(L1+m1, 0), where: ‘max’ denotes a maximum value; m1 is a margin value; L1=d(mreals,mrealf)−d(mreals,mfakef) is a first similarity score; d(mreals,mrealf) is a distance between the real speech embedding (mreals) and the real facial embedding (mrealf); and d (mreals,mfakef) is a distance between the real speech embedding (mreals) and the fake facial embedding (mfakef), determining a second similarity loss ρ2=max(L2+m2, 0), where: m2 is a margin value; L2=d(ereals,efakes)−d(ereals,efakef) is a second similarity score; d(ereals,efakes) is a distance between the real speech emotion embedding (ereals) and the fake speech emotion embedding (efakes); and d(ereals, efakef) is a distance between the real speech emotion embedding (ereals) and the fake facial emotion embedding (efakef); and adjusting the first, second, third, and fourth neural networks dependent upon a sum (L) of the first similarity loss and the second similarity loss.

14. The computer-implemented method of claim 8, further comprising training the first, second, third, and fourth, neural networks, including: generating, from a real video of a first subject: a real facial modality embedding (mrealf); a real speech modality embedding (mreals); a real facial emotion embedding (erealf); and a real speech emotion embedding (ereals); generating, from a fake video of the first subject: a fake speech modality embedding (mfakes); a fake facial emotion embedding (efakef); and a fake speech emotion embedding (efakes); determining a first similarity loss ρ1=max(L1+m1, 0), where: ‘max’ denotes a maximum value: m1 is a margin value; L1=d(mrealf,mreals)−d(mrealf,mfakes) is a first similarity score; d(mrealf,mreals) is a distance between the real facial embedding (mrealf) and the real speech embedding (mreals); and d(mrealf, mfakes) is a distance between the real facial embedding (mrealf) and the fake speech embedding (mfakes); determining a second similarity loss ρ2=max(L2+m2, 0), where: m2 is a margin value; L2=(erealf,efakef)−d(erealf,efake) is a second similarity score; d(erealf,efakef) is a distance between the real facial emotion embedding (erealf) and the fake facial emotion embedding (efakef); and d(erealf,efakef) is a distance between the real facial emotion embedding (erealf) and the fake speech emotion embedding (efakes); and adjusting the first, second, third, and fourth neural networks dependent upon a sum (L) of the first similarity loss and the second similarity loss.

15. The computer-implemented method of claim 8, where: passing the facial features through the first neural network (F1) to generate the facial embedding of the facial features includes passing the facial features through one or more of: a two-dimensional convolution layer, a max-pooling layer, a fully-connected layer, and a normalization layer; and passing the speech features through the second neural network (S1) to generate the speech embedding of the speech features includes passing the speech features through one or more of: a two-dimensional convolution layer, a max-pooling layer, a fully-connected layer, and a normalization layer.

16. A computer-implemented method for training a neural network to classify a video as real or fake, the method comprising: for a real video of a first subject: obtaining facial features extracted from visual content of the real video, the facial features including facial modalities and facial emotions; obtaining speech features extracted from audio content of the real video, the speech features including speech modalities and speech emotions; passing the facial modalities through a first network of the neural network to produce a real facial modality embedding (mrealf) passing the speech modalities through a second network of the neural network to produce a real speech modality embedding (mreals) ); passing the facial emotions through a third network of the neural network to produce a real facial emotion embedding (erealf); and passing the speech emotions through a fourth network of the neural network to produce a real speech emotion embedding (ereals); for a fake video of the first subject: extracting facial features from visual content of the fake video, the facial features including facial modalities and facial emotions; extracting speech features from audio content of the fake video, the speech features including speech modalities and speech emotions; passing the facial modalities through the first network to produce a fake facial modality embedding (mfakef) passing the speech modalities through the second network to produce a fake speech modality embedding (mfakes); passing the facial emotions through the third network to produce a fake facial emotion embedding (efakef); and passing speech emotions through the fourth network to produce a fake speech emotion embedding (efakes); determining a first similarity loss ρ1=max(L1+m1, 0), where: ‘max’ denotes a maximum value: m1 is a margin value; and L1 is a first similarity score for the speech and facial modality embeddings; determining a second similarity loss ρ2=max(L2+m2, 0), where: m2 is a margin value; and L2 is a second similarity score for the speech and facial emotion embeddings; and adjusting the first, second, third, and fourth neural networks dependent upon a sum (L) of the first similarity loss and the second similarity loss, where, when the audio content has been modified more than the visual content: L1=d(mrealf,mreals)−d(mrealf,mfakes), where d(x, y) denotes a distance between arguments x and y; and L2=d(erealf,efakef)−d(erealf,efakes), and when the visual content has been modified more than the audio content: L1=d(mrealf,mreals)−d(mreals,mfakef); and L2=d(ereals,efakes)−d(ereals,efakef).

17. The computer-implemented method of claim 16, further comprising determining whether the audio content has been modified more than the visual content or the visual content has been modified more than the audio content, including: comparing the facial features of the real video to the facial features of the fake video; and comparing the speech features of the real video to the speech features of the fake video.

18. The computer-implemented method of claim 16, where the facial and speech emotions include one or more of ‘happy’, ‘sad’, ‘angry’, ‘fearful’, ‘surprise’, ‘disgust’ and ‘neutral’ emotions.

19. The computer-implemented method of claim 16, where: passing the facial features through the first network includes passing the facial features through one or more of: a two-dimensional convolution layer, a max-pooling layer, a fully-connected layer, and a normalization layer; and where passing the speech features through the second network includes passing the speech features through one or more of: a two-dimensional convolution layer, a max-pooling layer, a fully-connected layer, and a normalization layer.

20. The apparatus of claim 1, where the first, second, third, and fourth networks are trained using real and fake videos to maximize the distance between a facial and a speech embedding of real and fake videos and minimize the distance between a facial and a speech embedding of real videos.

Patent Metadata

Filing Date

Unknown

Publication Date

June 3, 2025

Inventors

Trisha Mittal

Uttaran Bhattacharya

Rohan Chandra

Aniket Bera

Dinesh Manocha

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search