Multispectral Biometrics System Framework: Application to Presentation Attack Detection

April 20, 2021

Introduction

Biometric sensors have become ubiquitous in recent years with ever more increasing industries introducing some form of biometric authentication for enhancing security or simplifying user interaction. They can be found on everyday items such as smartphones and laptops as well as in facilities requiring high levels of security such as banks, airports, or border control. Even though the wide usability of biometric sensors is intended to enhance security, it also comes with a risk of increased spoofing attempts. At the same time, the large availability of commercial sensors enables access to the underlying technology for testing various approaches aiming at concealing one’s identity or impersonating someone else, which is the definition of a Presentation Attack (PA). Besides, advances in materials technology have already enabled the development of Presentation Attack Instruments, i.e., physical means for generating a PA, capable at successfully spoofing existing biometric systems.

Presentation Attack Detection (PAD) has attracted a lot of interest with a long list of publications focusing on devising algorithms where data from existing biometric sensors are used. In this work, we approach the PAD problem from a sensory perspective and attempt to design a system that relies mostly on the captured data which should ideally exhibit a distinctive response for PAIs. We focus on capturing spectral data, which refers to the acquisition of images of various bands of the electromagnetic spectrum for extracting additional information of an object beyond its visible spectrum. The higher dimensionality of multispectral data enables detection of other than skin materials based on their spectral characteristics. A comprehensive analysis of the spectral emission of skin and different fake materials shows that in higher than visible wavelengths, the remission properties of skin converge for different skin types (i.e., different race or ethnicity) compared to a diverse set of lifeless substances. Additionally, multispectral data offer a series of advantages over conventional visible light imaging, including visibility through occlusions as well as being unaffected by ambient illumination conditions.

Approach

In this work, we developed a general framework for building a biometrics system capable of capturing multispectral data from a series of sensors synchronized with active illumination sources. The framework unifies the system design for different biometric modalities and its realization on face, finger and iris data is described in detail. To the best of our knowledge, the presented design is the first to employ such a diverse set of electromagnetic spectrum bands, ranging from visible to long-wave-infrared wavelengths, and is capable of acquiring large volumes of data in seconds, which enabled us to successfully conduct a series of data collection events. We also present a comprehensive analysis on the captured data using a deep-learning classifier for presentation attack detection. Our analysis follows a data-centric approach attempting to highlight the strengths and weaknesses of each spectral band at distinguishing live from fake samples.

The main components of the proposed multispectral biometrics system framework are presented in the figure below. A biometric sample is observed by a sensor suite comprised of various multispectral data capture devices. A set of multispectral illumination sources is synchronized with the sensors through an electronic controller board. A computer provides the synchronization sequence through a JSON file and sends capture commands that bring the controller and sensors into a capture loop leading to a sequence of synchronized multispectral data from all devices. All captured data are then packaged into an HDF5 file and sent to a database for storage and further processing.

Our system design (both in terms of hardware and software) is governed by four key principles:

  1. Flexibility: Illumination sources and capture devices can be easily replaced with alternate ones with no or minimal effort both in terms of hardware and software development.
  2. Modularity: Whole components of the system can be disabled or removed without affecting the overall system’s functionality by simply modifying the JSON configuration file.
  3. Legacy compatibility: They system must provide at least some type of data that can be used for biometric identification through matching with data from older sensors and biometric templates available in existing databases.
  4. Complementarity: The variety of used capture devices and illumination sources aim at providing complementary information about the biometric sample aiding the underlying task at hand.

Hardware

The hardware design follows all principles described above providing a versatile system which be easily customized for different application needs.

  • Illumination modules: We have designed a Light-Emitting Diode (LED) based illumination module which can be used as a building block for creating a larger array of LEDs in various spatial configurations. It is especially made for supporting Surface-Mount Device (SMD) LEDs for compactness. The module, shown on the left side of the figure, contains 16 slots for mounting LEDs and uses a Serial Peripheral Interface (SPI) communication LED driver chip which allows independent control of the current and Pulse-Width-Modulation (PWM) for each slot. LEDs can be turned on/off or their intensity can be modified using a sequence of bits. Since current is independently controlled for each position, it allows combining LEDs with different operating limits.

  • Controller Board: The controller board also follows a custom design and uses an Arduino-based microcontroller (Teensy 3.6), shown on the right side of the figure, which can communicate with a computer through a USB2 serial port. The microcontroller offers numerous digital pins for SPI communication as well as 2 Digital-to-Analog (DAC) converters for generating analog signals. The board offers up to 4 slots for RJ45 connectors which can be used to send SPI commands to the illumination modules through ethernet cables. Additionally, it offers up to 6 slots for externally triggering capture devices through digital pulses, whose peak voltage is regulated by appropriate resistors. The Teensy 3.6 supports a limited amount of storage memory (1MB) on which a program capable of understanding the commands of the provided configuration file is pre-loaded. At the same time, it provides an accurate internal timer for sending signals at millisecond intervals.

Software

The software design aligns with the principles of flexibility and modularity described above. We have adopted a microservice architecture which uses REST APIs such that a process can send HTTP requests for capturing data from each available capture device.

  • Device Servers: Each capture device must follow a device server interface and should just implement a class providing methods for its initialization, setting device parameters and capturing a data sample. This framework simplifies the process of adding new capture devices which only need to implement the aforementioned methods and are agnostic to the remaining system design. At the same time, for camera sensors (which are the ones used in our realization of the framework), it additionally provides a general camera capture device interface for reducing any supplementary software implementation needs.
  • Configuration File: The whole system’s operation is determined by a JSON configuration file. It defines which capture devices and illumination sources will be used as well as the timestamps they will receive signals for their activation or deactivation. Further, it specifies initialization or runtime parameters for each capture device allowing adjustments to their operational characteristics without any software changes. As such, it can be used to fully determine a synchronized capture sequence between all available illumination sources and capture devices. Optionally, it can define a different preview sequence used for presenting data to the user through the Graphical User Interface (GUI). Finally, it also determines the dataset names that will be used in the output HDF5 file to store the data from different capture devices.
  • Graphical User Interface: The GUI provides data preview and capture capabilities. In preview mode, it enters in a continuous loop of signals to all available capture devices and illumination sources and repeatedly sends HTTP requests to all underlying device servers while data are being previewed on the computer screen. In capture mode, it first sends a capture request to each capture device for a predefined number of frames dictated by the JSON configuration file and then puts the controller into a capture loop for sending the appropriate signals. Captured data are packaged into an HDF5 file and sent to a database for storage.

Developed Biometric Sensor Suites

The realization of the described framework on face, finger, and iris biometric modalities is presented below. For our presented systems, all capture devices are cameras, and all output data are frame sequences appropriately synchronized with the activation of particular light sources.

We use a variety of cameras, each one sensitive to different portions (Visible — VIS, Near InfraRed — NIR, Short Wave InfraRed — SWIR and Long Wave InfraRed — LWIR or Thermal) of the electromagnetic spectrum (see paper for details). Cameras share different characteristics in terms of their resolution, frame rate, and dynamic range (bit depth). For some cameras, the sensitivity is restricted by using external band-pass filters in front of their lenses. The cameras were selected, among many options in the market, with the goal of balancing performance, data quality, user-friendliness, and cost (but clearly different sensors could be selected based on the application needs). All cameras supporting hardware triggering operate in blocking-mode, i.e., waiting for trigger signals from the controller for a frame to be captured. This way synchronized frames can be obtained. A few cameras do not support hardware triggering and are synchronized using software countdown timers during the capture process. Even though this triggering mechanism is not millisecond accurate, all cameras also store the timestamps of each frame so that one can determine (using software) the closest frames in time to frames originating from the hardware triggered cameras.

For the illumination modules, we chose a variety of LEDs emitting light at different wavelengths covering a wide range of the spectrum. Here, without loss of generality, we will refer to any separate LED type as representing a wavelength, even though some of them might consist of multiple wavelengths (e.g., white light). The choice of LEDs was based on previous studies on multispectral biometric data as well as cost and market availability of SMD LEDs from vendors. For each biometric sensor suite, we tried to maximize the available wavelengths considering each LED’s specifications and the system as a whole. Illumination modules are mounted in different arrangements on simple illumination boards containing an RJ45 connector for SPI communication with the main controller board through an ethernet cable. To achieve light uniformity, we created 6 main types of illumination modules which attempt to preserve LED symmetry. Wavelength selection and module arrangement for each sensor suite is presented in the figure below. In summary:

  • Face sensor suite: Employs 10 wavelengths mounted on 2 types of illumination modules and arranged in 4 separate groups. 24 illumination modules with 240 LEDs are used in total.
  • Finger sensor suite: Employs 11 wavelengths mounted on 3 types of illumination modules and arranged in 2 separate groups. 16 illumination modules with 184 LEDs are used in total.
  • Iris sensor suite: Employs 4 wavelengths mounted on a single illumination module type and arranged circularly. 8 illumination modules with 120 LEDs are used in total.

All system components are mounted using mechanical  parts or custom-made 3D printed parts and enclosed in metallic casings for protection and user-interaction. Additionally, all lenses used have a fixed focal length and each system has an optimal operating distance range based on the Field-of-View (FOV) and Depth-of-Field (DoF) of each camera-lens configuration used. It is important to note that our systems are prototypes and every effort was made to maximize efficiency and variety of captured data. However, the systems could be miniaturized using smaller cameras, fewer or alternate illumination sources, or additional components, such as mirrors, for more compact arrangement and total form factor reduction, while maintaining the total length of the optical path. Such modifications would not interfere with the concepts of the proposed framework which would essentially remain the same.

Face Sensor Suite

The face sensor suite uses 6 cameras capturing RGB, NIR(x2), SWIR, Thermal and Depth data. An overview of the system is presented below. Except for the LED modules, we further use two big bright white lights on both sides of our system (not shown in the figure) to enable uniform lighting conditions for the RGB cameras. The subject sits in front of the system and the distance to the cameras is monitored by the depth indication of the RealSense camera. We use a distance of about 62cm from the RealSense camera, which allows for good focus and best FOV coverage from most cameras. For the cameras affected by the LED illumination, we also capture frames when all LEDs are turned off, which can be used as ambient illumination reference frames. In this configuration, the system is capable of capturing ~1.3GB of compressed data in 2.16 seconds.

Finger Sensor Suite

The finger sensor suite uses 2 cameras, sensitive to the VIS/NIR and SWIR parts of the spectrum. An overview of the system is depicted below. The subject places a finger on the finger slit of size 15mm x 45mm, facing downwards, which is imaged by the 2 available cameras from a distance of ~35cm. The finger sensor suite uses two additional distinct types of data compared to the remaining sensor suites, namely, Back-Illumination (BI) and Laser Speckle Contrast Imaging (LSCI).

  • Back-Illumination: Looking at developed system, one can observe that the illumination modules are separated in two groups. The first one lies on the side of the cameras lighting the front side of the finger (front-illumination) while the second shines light atop the finger slit, which we refer to as BI. This allows capturing images of the light propagating through the finger and can be useful for PAD by either observing light blockage by non-transparent materials used in common PAIs or revealing the presence of veins in a finger of a bona-fide sample. The selected NIR wavelength of 940nm enhances penetration though the skin as well as absorption of light by the hemoglobin in the blood vessels making them appear dark. Due to the varying thickness of fingers among different subjects, for BI images we use auto-exposure and capture multiple frames so intensity can be adjusted such that the captured image is not over-saturated nor under-exposed.
  • Laser Speckle Contrast Imaging: Apart from the incoherent LED illumination sources, the finger sensor suite also uses a coherent illumination source, specifically a laser at 1310nm, which sends a beam at the forward part of the system’s finger slit. The laser is powered directly by the power of the Teensy 3.6 and its intensity can be controlled through an analog voltage using the DAC output of the controller board. Illuminating a rough surface through a coherent illumination source leads to an interference pattern, known as speckle pattern. For static objects, the speckle pattern does not change over time. However, when there is motion (such as motion of blood cells through finger veins), the pattern changes at a rate dictated by the velocity of the moving particles and imaging this effect can be used for LSCI. The selected wavelength of 1310nm enables penetration of light through the skin and the speckle pattern is altered over time as a result of the underlying blood flow for bona-fide samples. This time-related phenomenon can prove useful as an indicator of liveness and, in order to observe it, we capture a sequence of frames while the laser is turned on.

For each type of data captured under the same lighting conditions and the same camera parameters (i.e., exposure time), we also capture frames when all LEDs are turned off which serve as ambient illumination reference frames. In this configuration, the system is capable of capturing ~33MB of compressed data in 4.80 seconds.

Iris Sensor Suite

The iris sensor suite uses 3 cameras capturing NIR and Thermal data, as summarized in. An overview of the system is depicted below. The subject stands in front of the system at a distance of ~35cm guided by the 3D printed distance guide on the right side of the metallic enclosure.

One of the drawbacks of the current iris sensor suite is its sensitivity to the subject’s motion and distance due to the rather narrow DoF of the utilized cameras/lenses as well as the long exposure time needed for acquiring bright images. As a result, it requires careful operator feedback to the subject for appropriate positioning in front of the system. Higher intensity illumination or narrow angle LEDs could be used to combat this problem by further closing the aperture of the cameras so that the DoF is increased. However, further research is required for this purpose, taking into consideration possible eye-safety concerns, not present in the current design which employs very low power LEDs.

Data Overview

An overview of captured data by the proposed sensor suites for face (left), finger (top-right) and iris (bottom-right) biometric modalities is presented below. For cameras affected by LED illumination or capturing different data types, the middle frame of the capture sequence is shown. For the remaining cameras, equally spaced frames of the whole captured sequence are presented. Images are resized for visually pleasing arrangement and the relative size of images is not preserved.

Datasets and Evaluation

We have held 7 data collections with the proposed systems. However, our systems have undergone multiple improvements throughout this period and some data are not fully compatible with the current version of our system. The datasets used in our analysis contain only data across data collections that are compatible with the current design (i.e., the same cameras, lenses, and illumination sources were used). They involve 5 separate data collections of varying size, demographics and PAI distributions that were performed using 2 distinct replicas of our systems in 5 separate locations (leading to possibly different ambient illumination conditions and slight modifications in the positions of each system’s components). Participants presented their biometric samples at least twice to our sensors and a few participants engaged in more than one data collections. Parts of the data are already publicly available through separate publications and the remaining could be distributed later by the National Institute of Standards and Technology (NIST).

In this work, we separate all data from the aforementioned data collections in two groups (data from the 4 former data collections and data from the last data collection). The main statistics for the two groups, which will be referred to as Dataset I and Dataset II, respectively, as well as their union (Combined) are summarized in the table on the side.

For each biometric modality, we define a set of PAI categories, which will be helpful for our analysis. We tried to form compact categories, which encapsulate different PAI characteristics, as well as consider cases of unknown PAI categories among the two datasets. Finally, it is important to note that the age and race distributions of the participants among the two datasets is drastically different. Dataset I is dominated by young people of Asian origin while Dataset II includes a larger population of Caucasians or African Americans with a skewed age distribution toward older ages, especially for face data.

In order to support the complementarity principle of our design, we devise a set of PAD experiments for each biometric modality. Two class classification, with labels {0,1} assigned to bona-fide and PA samples, respectively, is performed using a convolutional neural network (CNN) based model, presented below. Due to the limited amounts of training data, inherent in biometrics, we follow a patch-based approach where each patch in the input image is first classified with a PAD score in [0, 1] and then individual scores are fused to deduce the final PAD score for each sample. Unlike traditional patch-based approaches, where data are first extracted for patches of a given size and stride and then passed through the network, we use an extension of the fully-convolutional-network (FCN) architecture.

The data for each biometric modality are preprocessed before being passed to the developed deep network. Examples of pre-processed multispectral data for bona-fide samples and select PAI categories for each biometric modality are presented in the following figure. In some cases, images have been min-max normalized within each spectral regime for better visualization.

Results

The goal of our analysis is the understanding of the contribution of each spectral channel or regime to the PAD problem as well as the strengths and weaknesses of each type of data by following a data-centric approach.  Therefore, we use a model that remains the same across all compared experiments, per modality. As such, we try to gain an understanding on how performance is solely affected by the data rather than the number of trainable model parameters, specific model architecture, or other training hyperparameters.

We follow two different training protocols:

  • 3Fold: All data from the Combined dataset are divided in 3 folds. For each fold, the training, validation, and testing sets consist of 55%, 15% and 30% of data, respectively. The folds were created considering the participants such that no participant appears in more than one set, leading to slightly different percentages than the aforementioned ones.
  • Cross-Dataset: Dataset I is used for training and validation (85% and 15% of data, respectively) while Dataset II is used for testing. In this scenario, a few participants do appear in both datasets, for the finger and iris cases, but their data were collected at a different point in time, a different location and using a different replica of our biometric sensor suites.

We conducted a series of comprehensive experiments to analyze the PAD performance capabilities of the captured data. First, for all three biometric modalities, we perform experiments when each spectral channel is used separately as input to the model. For face and finger data, due to the large number of channels, we further conducted experiments when combinations of 3 input channels are used. On one hand, this approach aids in summarizing the results in a compact form but also constitutes a logical extension. For face, 3 is the number of channels provided by the RGB camera while for finger, there are 3 visible light illumination sources and LSCI data are inherently time-dependent, hence sequential frames are necessary for observing this effect. Additionally, the second type of experiment can serve as a comparison to data currently available through commercial sensors, i.e., RGB-only data for face, visible light channels for finger, and the IrisID camera’s data for iris. We choose not to study larger channel combinations so that we accentuate the individual contribution of each type of available data to the PAD problem, but always adhere to the rule of comparing experiments using the same number of input channels and therefore contain the same amount of trainable model parameters.

Each experiment uses the same model and training parameters. During training, each channel is standardized to zero-mean and unit standard deviation based on the statistics of all images in the training set, while the same normalizing transformation is applied when testing. All experiments are performed on both 3Fold and Cross-Dataset training protocols explained above. For each type of experiment, we also calculate the performance of the mean PAD score fusion of all individual experiments (denoted as Mean). As performance metrics, we report the Area Under the Curve (AUC), the True Positive Rate (TPR) at 0.2% False Positive Rate (FPR) (denoted as TPR0.2%) and the Bona-fide Presentation Classification Error Rate at a fixed Attack Presentation Classification Error Rate (APCER) of 5% (denoted as BPCER20 in the ISO standard).

The results from all experiments are summarized below. The left part analyzes the single channel experiments by drawing error bars of the PAD score distributions for bona-fide samples and each PAI category defined earlier. The error bars depict the mean and standard deviation of each score distribution bounded by the PAD score limits [0,1]. Hence, full separation of error bars between bona-fides and PAIs does not imply perfect score separation. However, it can showcase in a clear way which channels are most effective at detecting specific PAI categories. The right part presents the calculated ROC curves and relevant performance metrics for the 3-channel experiments for face and finger and 1-channel experiments for iris.

Discussion

In general, the presented analysis suggests that, for each biometric modality, there are channels which can alone offer high PAD performance. However, the overall method followed in this work is by design not optimal for PAD purposes. In order to unveil the full power of multispectral data, one should use multiple input channels. Models employing multi-channel inputs can provide very high PAD accuracy. Additionally, an ideal model for multispectral data can further benefit from attention mechanisms (e.g., using transformers) or significant channel selection among all available input channels and early or late fusion methods. Even more, PAD methods could exploit the video nature of the captured data by observing liveness through human natural motion (e.g., eye blinking). The presented analysis herein does not aim to replace or be compared with optimal PAD methods. Instead, a single algorithmic approach was selected to facilitate a common analysis of different spectral bands on all studied biometric modalities and showcase the strengths and weaknesses of distinct regimes in the captured data. Certainly, shortcomings of PAD methods employing limited number of input channels, as the ones observed in the Cross-Dataset protocol analysis, could be alleviated by more sophisticated or robust classification methods (e.g., using pre-training, transfer learning or fine-tuning techniques), but the purpose of our analysis was not to design the best possible PAD classification model. Instead, we analyze the capabilities of the data captured through the proposed sensor suites in two ways. First, we capitalize on the limitations originating from certain wavelength regimes and their comparison to more effective spectral bands. Second, we stress the importance of the availability of a variety of spectral bands which offer complementary information that can be exploited for training robust classification models.

The presented framework and specific biometric sensor suite designs are the result of a multi-year effort during which multiple design decisions have been changed for improving performance. Our earlier systems utilized fewer wavelengths and their architecture was even more complex due to the use of lower-resolution cameras which required additional hardware components (e.g., fast-steering mirrors for finger) for a full observation of the biometric sample in certain regimes. On top of that, the capture was not synchronized and cameras were only controlled through software and had to be invoked sequentially to ensure capture under different illumination conditions. This resulted to very long capture durations and did not exploit the complementarity of spectra (e.g., an RGB camera cannot sense SWIR light). It soon became apparent that such an approach had various limitations. First, it led to the subject moving widely throughout the capture, making data alignment challenging. Second, the system was inherently low throughput due to the long capture durations (e.g., a full capture of a finger took around 2 minutes, versus 4.5 seconds in this work, while capturing less data). Employing such a system in a high-bandwidth environment was very limiting. As a result, we adopted a hardware-based design that allows simultaneous and synchronized capture by all available sensors while exploiting each sensor’s non-sensitivity to certain spectra, achieving the reduction of capture time to seconds. More importantly, the new design enabled the use of common hardware components on all biometric sensor suites, resulting to a unified framework that can be easily customized with different sensors and illumination sources to achieve specific application needs. Additionally, the REST API based software design allows system components to operate in a distributed fashion (on different machines, if needed), which proved very valuable in cases of computer system failures, in our experience.

Finally, we need to emphasize that the proposed framework is very general and can be adopted for different applications requiring high-bandwidth collection of multispectral data. Examples include, but are not limited to, multispectral material characterization or biomedical applications. Nevertheless, there is still room for improvement. Our design is an experimental prototype and could be certainly miniaturized by using custom-made sensors as well as a market-oriented compact design, if produced in mass scales. However, this would not interfere with the basic concepts of our framework that capitalize on the re-usability of common components on different, application-specific, multispectral sensor suites.