BioFors: A Large Biomedical Image Forensics Dataset

August 30, 2021

Introduction

Virtually all aspects of modern society have been shaped by developments in science and technology. Most of the modern world invests to promote research and development and in turn utilizes the outcomes to shape a better future. At the core of this symbiotic relationship is an inherent trust in the integrity of the scientific process i.e. the experiments and findings of the scientific community are authentic. But what happens when this trust is shaken?

A famous case is that of Dr. Hwang Woo Suk who falsely claimed to have successfully cloned human cells. His bold assertion drew further scrutiny which ended up exposing him. Sadly this is not an isolated case and other instances of research misconduct keep popping up. Of the various scientific domains, the biomedical research community repeatedly comes across paper retractions due to research misconduct. A common manifestation of misconduct is the presence of duplicated and forged biomedical images in a bid to present non-existent experimental findings.

Dr. Elisabeth Bik has been spearheading awareness of this problem and shows examples of papers with manipulated images. To anyone that is vaguely familiar with the scientific publication process, an obvious question comes to mind. Why doesn’t the review process filter out papers with fraudulent images? The answer lies in difficulty and scale. Spotting duplicated or forged regions in biomedical images is an arduous task. Add to that the increased volume of publications and it becomes nearly impossible for a reviewer to pinpoint cases of fraud. Figure 1 shows a pair of images with duplicated image regions that you can test yourself on. Considering the significance and challenges of the problem, it’s important to develop automated methods for the verification of images.

Figure 1. Can you identify the duplicated regions between the two images? 

Dataset

Curation: Collecting images for a feasible dataset requires positive samples i.e. manipulated images. The images for this dataset come from a corpus of 1031 documents graciously provided by Dr. Bik. In the absence of robust biomedical image extraction software, all images were hand cropped in a 2 step process as outlined in Figure 2. Computer-generated images such as flowcharts, tables, histograms, graphs and diagrams are excluded from the dataset, leaving images that are the outcome of biomedical experiments. The process resulted in 30,536 training and 17,269 test images.

Figure 2. Stages of manual cropping of images from scientific figures. Computer generated images are left out.

Categorization: While all images are from biomedical literature, they originate from a wide range of experiments and therefore are visually distinct. Visual similarity matters in the development of deep learning or computer vision based models. As a result, images are categorized into four distinct categories. Namely, Microscopy, Blot/Gel, Fluorescence-activated cell sorting (FACS) and Macroscopy images. Microscopy images are collected from under a microscope and tend to highlight cellular features that are otherwise not visible to the naked eye. Blot/Gel images are usually grayscale images and originate from protein analysis experiments. FACS images look similar to scatter plots but are the result of actual physical experiments. Finally, macroscopy images show diverse experimental artifacts. Figure 3 shows examples of the four image categories.

Figure 3. Sample images of the four image categories.

Tasks: Finding a one-to-one mapping of biomedical image manipulations with classical definitions of image forgery in computer vision literature is not feasible. To this end, three new tasks are defined in a manner that they are comprehensive and relatable to classical definitions. The three tasks are external duplication detection (EDD), internal duplication detection (IDD) and cut/sharp-transition detection (CSTD). EDD task involves finding duplicated regions across images. IDD task involves finding duplicated regions within images. Finally, the CSTD task involves finding evidence of tampering in images, usually in the form of discontinuities and sharp transitions. There are 1547, 102 and 181 tampered images included for EDD, IDD and CSTD tasks respectively. Samples are shown in Figure 4.

Figure 4. Images showing EDD, IDD and CSTD Manipulations.

Biomedical vs Natural Image Forensics

The paper further describes four aspects of biomedical image forensics that make it unique and more challenging when compared to traditional natural image forensics problems. These four challenges relate to annotation artifacts, figure semantics, image texture and hard negatives. Presence of scientific annotation such as text, arrows and lines can increase the chances of finding false positives. Figure semantics involves chemical staining of the same image to create similar looking but legitimate variants. Biomedical images in general and blot/gel images in particular have a plain texture which makes it difficult to identify feature descriptors. Finally, hard negative samples refer to visually similar but distinct images.

Experimental Results

The dataset is benchmarked with multiple baselines including both deep-learning and classical methods. Evaluation is done at both the image and pixel level. Matthews correlation coefficient (MCC) scores are reported for each experiment. Table 1 shows the SotA results for each task. Detailed experimental results can be found in the paper.

Analysis

EDD

IDD

CSTD

Image

0.278

0.569

0.170

Pixel

0.324

0.364

0.080

Table 1.  SotA results on each of the individual tasks, with MCC metric.

The SotA results in Table 1 for EDD and IDD columns come from classical models for duplicate and forgery detection. One of the reasons for this is the use of coarse image features which obscures the detailed features present in biomedical images. These results highlight the importance of developing deep learning models dedicated to biomedical image forensics.