MEG: Multi-Evidence GNN for Multimodal Semantic Forensics

November 23, 2020

Introduction

Fake news has become a familiar term in recent years. It’s infamy is superseded by the harm that it creates in society. It has the ability to manipulate election outcomes or upend a stock value. But what is fake news and what gives it this undue influence? In simple terms, it refers to manipulated multimedia (e.g. text, images, videos etc.) containing false information. A toxic mixture of truth and falsehood usually makes it believable and persuasive. It’s a no-brainer that we have to put in our effort to mitigate its impact. This blog discusses methods to tackle a subset of the problem called Image repurposing – where a pristine image is associated with false metadata to convey misinformation. Figure 1 shows an example of image repurposing.

Figure 1. An example of image repurposing in a multimedia package. An image of the replica of the Statue of Liberty taken in Japan has been falsified to present misleading information.

A combination of the image and its metadata is referred to as a multimedia package. The problem of image repurposing detection involves identifying manipulated multimedia packages. The problem is particularly difficult since the multimedia packages tend to be consistent within themselves. To the information in a multimedia package, external information is required to corroborate and validate it.

Previous works in the field have attempted to tackle the problem by verifying the questionable multimedia package with a single evidence retrieved from a dataset of multiple packages. However, this effort is limited to using a single evidence, which inherently limits its efficacy. In this work we propose to develop a graph neural network structured model which can utilize a variable number of evidence for verification.

Approach

The proposed graph neural network (illustrated in Figure 2) model has multiple components and stages which are described in order:

Figure 2. An illustration of the proposed graph neural network to analyze multiple evidences. Each branch is used to corroborate an individual modality for evidence.

Package Retrieval: Corroboration of evidence requires additional information from a reliable reference dataset. Selection of top-k packages for authentication requires scoring them. The scoring system used here is from previous work i.e. scoring of individual modalities in a package and using the combined average.

Feature Extraction: Learned feature extraction is an important component of deep learning models. Based on previous literature, CNN, word2vec, and global positioning system (GPS) coordinates are used for image, text, and location modalities respectively.

Attention-based Evidence Matching: Matching of evidence in multimedia packages involves comparing named entities versus memorizing specific features. The proposed method utilizes attention-based modules for matching query and retrieved features. This step is performed in a one-on-one matchup for all retrieved packages with the suspicious query package.

Modality Summary: The modality summary module summarizes the features from all evidence within a specific modality. Thus, there are three summarizations for image, text and location modalities. Cross-modal connections can also be utilized to improve the inference. Individual cells of the graph neural network (GNN) follow a traditional implementation from literature.

Manipulation Detection: The final stage of the model is to combine the features from all modalities to provide  an inference decision.

Experimental Results

Evaluation of the proposed method is performed on three datasets from literature, namely Multimodal Entity Image Repurposing (MEIR) Dataset, Google Landmarks Dataset and Painters by Numbers Dataset. A sample of query and retrieved packages from the literature is shown in Figure 3.

Figure 3. For each suspicious query package (on the left), multiple packages are retrieved. The retrieved package may be related (in green) or unrelated (in red).

Results of the proposed model – Multi-Evidence GNN (MEG) on all three datasets are shown in Table 1. The metric presented here is AUC score. Comparison against two baselines, semantic retrieval system (SRS) and deep-multimodal matching (DMM) from literature are also shown.

Method

MEIR

Painter by Numbers

Google Landmarks

SRS

0.67

0.77

0.93

DMM

0.88

0.74

0.93

MEG

0.92

0.86

0.94

Table 1. The proposed method (MEG) outperforms other baselines from literature across all datasets.

.