Customizable Camera Verification for Media Forensic

September 02, 2021

Introduction

In media forensic, there has been extensive research in finding fake news using embedded imagery materials (photos or videos). Not only manipulated images can make fake news, but unaltered images can also be repurposed to make the news look real. Thus, when no evidence of manipulation can be found by using all scientific methods, one may also consider whether the materials were really taken from the event mentioned in the news.

There has been extensive research work on manipulation detection and repurposing detection aiming at verifying the digital integrity of the images, videos, and textual content in the news. Besides these approaches, researchers have also come up with the idea of comparing the PRNU [1] of the reporter’s camera with the original photograph in question. When successful, this approach can effectively tell if the photo was “borrowed” from an unspoken source (e.g., some sort of online image search engine) rather than being taken by the reporter himself/herself.  The PRNU of a camera is the abbreviation of the Photo Response Non-Uniformity Noise. The CCD or CMOS sensor has slightly different response from one pixel to another due to the limitation of the manufacture precision. The non-uniform response of the camera will lead to a fixed-pattern noisy 2D image when capturing a flatfield image (see Fig. 1). Identifying the PRNU pattern from a document image is easier. However, scene text images that appear more in media news tend to have complex backgrounds than document images which makes it a non-trivial problem to isolate the PRNU pattern from the image (see the example in Fig. 2).

Screenshot 2022-01-12 at 10.14.01 PM

Fig. 1. The PRNU patterns from four cameras normalized using histogram equalization. Three instances of flatfield images were showed for each camera. Original images are flatfield images selected from the NIST MFC20 dataset.

Screenshot 2022-01-12 at 10.15.14 PM

Fig. 2. A sample scene text image selected from the NIST MFC20 dataset.

Major challenges

1) Challenge in computation and memory requirement

One major challenge in camera verification is the huge input dimensions. The PRNU pattern can be thought of as the high frequency noise after a homomorphic filter is applied. Thus, we ideally need to feed the image in its original dimensions into a neural network. And this will be very demanding for the GPU memory given the prevailing dimensions of modern digital cameras.

On the other hand, we have also experimented with the NIST MFC training data and trained a network with a CNN/dense-layers architecture and obtained competitive performance. In fact, the classifier we built only takes a 224×224 patch from the center of the image to verify, and makes the decision using the patch (As one can see from Fig. 3, the selected patch is a very small fraction of the entire image.) Surprisingly, we obtained a very competitive result: 85.8% AUC on the NIST MFC19 Eval set (the best team in the open challenge obtained 79.7%) and 77.5% on the MFC20 Eval set (only lower than one team with a 87.2% AUC). (Our deep learning classifier is stronger than the Siamese network [7] as the latter can only perform image-to-image matching.) Inspired by this result, we have been expecting to see significantly higher performance when more data are fed into the network.

Screenshot 2022-01-12 at 10.17.35 PM

Fig. 3. A sample image selected from the NIST MFC20 dataset with a 224×224 pixel region at the center of the image highlighted.

2) Ad-hoc creation of training data

Due to the way camera verification is used in media forensic, the analyst will not receive an off-the-shelf model and deploy it to predict the image-to-camera similarity. Instead, a real-world application requires the analyst to collect training data and feed the training data to the provided training pipeline and get the model. The risk of using a custom model is the performance will be unknown when designing the algorithm. Since it is very difficult to get a perfect guideline to instruct the analyst how to prepare the training set, we can only suggest very basic tips such as shooting at as many scenes as possible, avoiding taking multiple photos as the same scene, and including significant number of natural scenes rather than flatfield images to emulate the actual distribution of test data. A good example is showed in Fig.4 where the photos from a camera in NIST MFC20 showed a reasonable representation and coverage of real-world data samples.

Screenshot 2022-01-12 at 10.20.41 PM

Fig. 4. Some of the sample images from camera PAR-1579 of NIST MFC20 dataset, showing a variety of scenes.

Besides, since each camera has its own model weights, even though the model may use a soft-max layer to predict a normalized confidence score, say, between 0 and 1, the same score can still mean different level of confidence as they are yielded by different models. A scientific way would be to align the score to its real confidence using validation data. It is also useful for the analyst to plot a ROC for each individual camera. Given a confidence score, one can also look up the raw data for the ROC to get the corresponding TPR/FPR values and provide a quantized interpretation of the camera verification score.

Related Work

The PRNU pattern from a natural scene image can be extracted using the noise residual operator [2], homomorphic filter [3], image descriptors [4] or deep convolutional feature extractor [5][6]. From any method, the PRNU is still contaminated heavily by noise and it requires sophisticated metric learning model [7] to compute the similarity between the image and the flatfield PRNU. Although camera verification is not a hot topic, established research work has been made focusing on representing the pattern and matching algorithm. Notably, NIST has collected the MFC training data of very good size and made the data publicly available through its open challenges in recent years (2018-2020) [8]. Several systems showed promising results within a range between 70% and 90% in AUC. The aforementioned performance looks encouraging. If it is possible to obtain more significant performance gain and further refine the steps of the application, it is very promising to meet the requirement of the real-world application.

Approach

Camera Verification Model

We adopted the VGG16 [9] architecture for feature extraction. The input layer takes a 224x224x3 RGB image as the input. Following the last VGG16 layer, we appended a multi-layer classifier to classify positive (mismatched image and camera) vs. negative (matched image and camera) samples. The architecture of the classifier is showed in Fig. 5.

Screenshot 2022-01-12 at 10.33.40 PM

Fig. 5. Binary classifier architecture following the feature extraction stage.

Fig. 6. Illustration of sampling multiple patches from an image. All these patches can be used to predict camera verification result.

Fig. 7. Examples of concatenated patches.

We could concatenate these patches for each photo to create a larger input image and feed it to the neural network. However, there are at least three advantages to letting the input layer take one patch at a time:

  1. Most camera verification applications cannot afford to collect too many training images. Increasing the input dimensions exponentially will lead to the overfitting problem.
  2. Increasing the input size exponential may also exceed the GPU memory.
  3. When patches are not concatenated, it is more flexible to handle the situations such as cropped images.

The only thing that could be a concern is when patches from different locations match each other very well. But the chance by which this happens is extremely low owing to the fact that the PRNU pattern is a sort of high-frequency noise and it is unlikely to find two 224×224 spots that have well-matched PRNU in camera sensors.

Here is how we handle multiple patches per photo in our system:

Training: Each patch is treated as an independent training sample and no concatenation is performed. Each patch is rotated by 90, 180, and 270 degrees to create three more training samples. The rotation-based augmentation is to handle all possible rotations of the camera.

Inference: Inference using the trained model is performed on all patches from a photo and the maximum confidence score is taken as the confidence score of the photo.

Experimental Results

Dataset Overview

The NIST MFC20 [8] dataset for camera verification training has 106 cameras and 35695 natural scene images (previously released MFC19 and MFC18 can be considered subsets of MFC20.) The Eval set for MFC20 has 11288 image-camera pairs to verify. The Eval set for MFC19 has 8804 pairs to verify. Some of the training images have already been demonstrated in Fig. 4.

Single-patch Open Set Verification Results

As we have mentioned above, to run the NIST evaluation protocol, we trained a model to predict the similarity score between each image-camera pair. And all the scores were collected to plot a global DET curve and AUC was also computed.

In this experiment, the classifier in Fig. 5 has a small modification such that the output layer has as many units as the number of cameras in the training set rather than 2. Thus, the training process was a closed set training on known cameras whereas the evaluation was performed as an open set problem [10] without requiring the negative sample to belong to any of the cameras in the training set. The system took the score corresponding to the camera as the similarity between the photo and the camera. Only the central 224×224 patch of each photo was used in training and evaluation. The AUC numbers are showed in Table 1. Our competitive results using only a 224×224 patch has inspired us to expand our system to a multi-patch customizable modeling approach.

Table 1. Camera verification results on the NIST MFC19 and MFC20 Eval sets.

Set

DET-AUC (best team in NIST open challenge)

DET-AUC (ours)

MFC19

79.7%

85.8%

MFC20

87.2%

77.5%

Table 2. A random selection of cameras from the MFC20 training set

Camera Name

Number of photos (including rotated)

285540_Primary

132

50050172_Primary

156

50052808_Primary

60

PAR-1216_Primary

1548

PAR-1226_Primary

72

PAR-1579_Primary

160

PAR-1580_Primary

36

PAR-1581_Primary

180

PAR-1583_Primary

148

PAR-1589_Primary

176

PAR-2629_Primary

2260

PAR-2631_Primary

1216

PAR-3645_Primary

388

PAR-4335_Primary

648

Negative

136

Multi-patch Customizable Modeling Results

We randomly selected the following cameras showed in Table 2 from the MFC20 training set. A 64%: 16%: 20% ratio is adopted when creating the training: validation: test partition. We used the train set to train the model using the binary classification architecture, selected the best model within 50 epochs by monitoring the classification accuracy on the validation set, and applied the model to the test set.

Note the objective of a real-world application is to find repurposed image. Thus, it is important to define mismatched image-camera pairs as “positive”. And the detection performance can be measured using the TPR values at given FPR. And these values can be computed from an ROC curve.

After we plotted the ROC curve for each camera, we obtained an AUC and TPR@0.2%FPR from each curve. These values are showed in Table 3. The average values of these two measurements are showed in Table 4. The result showed upgrading from a 1-patch-per-image model to a 9-patch-per-image model significantly improved the repurposing detection performance in both the average AUC and average TPR (FPR=0.2%). Notably, the improvement in TPR value makes the application much more powerful in detection while maintaining the reliability of controlling false alarms. 

We plotted the ROC curves for camera PAR-1589, the one that has the lowest performance when using one patch per image. First of all, the gain from using multiple patches can be visualized quite well using the ROC curves. And more importantly, all the TPR and FPR values in those curves have been linked to their corresponding confidence score. Thus, an analyst can retrieve the estimated TPR and FPR from the curves for each photo being examined and provide quantized evidence for repurposing detection. 

Table 3. Repurposing detection performance using camera verification (performance of each camera).

Camera ID

Metric

Single-patch

9-patch

25-patch

285540_Primary

 

ROC-AUC

100.0%

100.0%

100.0%

TPR@0.2%FPR

100.0%

100.0%

100.0%

50050172_Primary

ROC-AUC

89.8%

100.0%

100.0%

 

TPR@0.2%FPR

28.6%

100.0%

100.0%

50052808_Primary

 

ROC-AUC

83.9%

85.7%

71.4%

TPR@0.2%FPR

46.4%

57.1%

14.3%

PAR-1216_Primary

ROC-AUC

99.9%

92.0%

100.0%

TPR@0.2%FPR

89.3%

53.6%

100.0%

PAR-1226_Primary

ROC-AUC

100.0%

100.0%

99.1%

TPR@0.2%FPR

100.0%

100.0%

96.4%

PAR-1579_Primary

ROC-AUC

92.9%

100.0%

100.0%

TPR@0.2%FPR

42.9%

100.0%

100.0%

PAR-1580_Primary

ROC-AUC

100.0%

100.0%

100.0%

TPR@0.2%FPR

100.0%

100.0%

100.0%

PAR-1581_Primary

ROC-AUC

100.0%

99.2%

98.4%

TPR@0.2%FPR

100.0%

85.7%

85.7%

PAR-1583_Primary

ROC-AUC

100.0%

100.0%

100.0%

TPR@0.2%FPR

100.0%

100.0%

100.0%

PAR-1589_Primary

ROC-AUC

69.4%

99.1%

98.5%

TPR@0.2%FPR

0.0%

89.3%

78.6%

PAR-2629_Primary

ROC-AUC

98.4%

99.0%

97.4%

TPR@0.2%FPR

0.0%

57.1%

14.3%

PAR-2631_Primary

ROC-AUC

96.7%

99.9%

99.8%

TPR@0.2%FPR

0.0%

82.1%

71.4%

PAR-3645_Primary

ROC-AUC

100.0%

100.0%

98.8%

TPR@0.2%FPR

100.0%

100.0%

85.7%

PAR-4335_Primary

ROC-AUC

100.0%

100.0%

100.0%

TPR@0.2%FPR

100.0%

100.0%

100.0%

(a) Linear scale ROC

(b) Logarithm scale ROC

Table 4. Repurposing detection performance using camera verification (Average of all cameras).

Metric

Single-patch

9-patch

25-patch

Avg ROC-AUC

95.1±8.9%

98.2±4.2%

97.4±7.5%

Avg TPR@0.2%FPR

64.8±43.0%

87.5±18.1%

81.9±30.1%