Two-branch Recurrent Network for Isolating Deepfakes in Videos

August 28, 2020


On social media and the Internet, visual disinformation has expanded dramatically. The fake video in today’s digital age is frequently highly convincing, giving the impression that the swapped subject is the actual acting person in the video. Thanks to recent advances in data synthesis using Generative Adversarial Networks (GANs), Deep Convolutional Neural Networks (DCNN), and AutoEncoders (AE), face-swapping in videos with hyper-realistic results has become effective and efficient for non-experts with a few clicks through customized applications, or even mobile applications.

Deepfakes began as a way to entertain people, but they quickly grew in popularity as a way to disseminate political instability, revenge porn, and defamation. Because of these factors, the widespread distribution of deepfakes on the Internet has become a threat to society, leading to the widespread belief that seeing is no longer believing. We propose a deep learning architecture to detect hyper-realistic face modifications in order to reduce the spread of edited movies.

Unlike face recognition datasets, there has been a scarcity of large-scale face forensics datasets for both training and evaluation in the community. Furthermore, previous specific face manipulation detection methods were primarily assessed on still images rather than videos. On the other hand, while image forensics has been studied extensively for a long time, deepfakes is a relatively new technique, and as a result, numerous orthogonal works have recently been offered to solve the problem of detecting face manipulations. A most recent related study is looking at detecting fully GAN-generated facial photos with the goal of modeling GAN fingerprints.


We demonstrated a method for video-based deepfake detection that use a recurrent model to analyze aligned face sequences and a two-branch backbone with a loss function to extract manipulated face sequences.

Unlike current methods that extract spatial frequencies as a preprocessing step, our two-branch structure is based on densely connected layers: one branch propagates the original information from the color domain, while the other branch suppresses the face content while amplifying multi-band frequencies using a Laplacian of Gaussian (LoG) as a bottleneck layer.

In contrast to prior methods that use binary cross-entropy for identifying face manipulations, we create a unique cost function that, unlike regular classification, favours compactness of natural face representations while pushing manipulated faces away for better, wider separation bounds. Furthermore, we improve the generality of our technique across datasets, reaching a fair balance between bias and variance.

In addition, we apply sequential modeling for video-based detection. Our method processes sequences of aligned faces from a video, extracts discriminative features using the backbone, and performs recurrent modeling using bi-directional long short-term memory (LSTM) supervised by our new loss. 

The entire network is trained end-to-end so that the recurrent model back-propagates to the feature extractor. Our system’s predictions on face videos downloaded from the web when trained only on FaceForensics++ are shown in the graph below.

Experimental Results

In general, our approach scores the highest accuracies across manipulations for all the compression levels when trained on the four manipulations (Deepfakes, FaceSwap, Face2Face, and Neu- ralTextures) along with the natural faces. It also achieves a decent balance between bias and variance, as well as superior cross-dataset generalization.

When it comes to the evaluation results at the video level, our method is still in strong competition with other state-of-the-art methods, as shown in the figure below. 

However, there is tremendous space for improvement in real, web-scale systems with low false alarm rates. We intend to measure the impact of data augmentation and the use of additional external natural faces in the near future. In the long run, we want to add an explainability mechanism to our model that doesn’t require pixel-by-pixel supervision for face manipulations.