Assessment of Facial Morphologic Features in Patients With Congenital Adrenal Hyperplasia Using Deep Learning

November 02, 2020


Congenital adrenal hyperplasia (CAH) is a genetic disorder that causes abnormal function of the adrenal gland. CAH is the most common adrenal insufficiency disorder in children and has serious life-long health implications, including blood pressure, diabetes, altered cognition, and obesity.  Other adverse neuropsychological outcomes have also been identified over the lifespan of patients with CAH, including a heightened potential for psychiatric disorders, substance abuse, and suicide, and brain structural abnormalities have been identified in youths and adults with CAH.

Despite being often discovered during pregnancy, CAH is not treatable, and patients must be on hormone therapy for their entire life. As of today, there does not exist robust phenotypical biomarkers (i.e. observable characteristics or traits,) with which physicians can adjust medication, without asking patients to undergo other types of tests, such as blood work, which could be expensive and/or time consuming and inefficient.

It has been shown that sex hormones (i.e. testosterone, estrogen) influence the development of sexually dimorphic facial features, with differential morphologic features in adults associated with umbilical cord blood testosterone levels. This lack of robust phenotypic biomarkers leads us to consider the human face, which contains a wealth of information, including health status and differences by sex. Our hope is to be able to develop a phenotypic biomarker that can be easily used by physicians to adjust medications for CAH patients to improve their quality of life.

Earlier facial analyses methods have relied on sets of manually engineered features, such as facial width-to-height ratio, masculinity index, or Euclidean distances between facial landmarks. However, these techniques have been widely applied to analyze syndromic genetic conditions that have easily recognizable effects on facial morphologic features compared with the more subtle facial features of patients with CAH.


As shown in the figure below, we apply automatically detect the face in images acquired using an iPAD in the Children’s Hospital of Los Angeles (CHLA). We further detect 68 facial landmarks and use the detected landmarks to estimate the 3-dimensional (3D) pose of the face (i.e. yaw, pitch, and roll rotation angles). Faces with extreme poses were ignored. Meanwhile, faces with a yaw angle < 30° were considered for further analysis. The detected landmarks were used to perform 3D geometric alignment (i.e. fractalization) of the detected face to eliminate effects of face pose in subsequent analyses, which has been shown to improve various facial analysis tasks in the literature.

As a baseline and to be able to compare against the state-of-the-art, we extracted 27 handcrafted features by calculating the 2D Euclidean distances between the 68 landmarks detected on the face. These features have been used for the study of sex differences of the face and the association of prenatal androgens with facial morphologic features. Because the landmark on top of the forehead is not a standard landmark that is detected by off-the-shelf landmark detection methods, we manually annotated the entire data set with this landmark. We used these 27 handcrafted features to perform a statistical analysis of the discriminability of features between patients with CAH and controls.

Meanwhile, aligned face images are fed into a convolutional neural network such that the network learned representations/features to predict the severity of CAH. We used the VGG16 model, which was pretrained to perform a face recognition task using VGGFace dataset. The classification layers of VGG16 were replaced with a small network including 3 fully connected layers with a 2-output sigmoid layer indicating the CAH probability. In VGG16, the dimensionality of learned representations is 4096, which is higher than the 27D feature vector and encodes more information for CAH score prediction.

Because the size of our CAH data set was smaller than the data set used to train VGG16 (which is typical for medical applications,) we froze the weights of the feature extraction part of the network and only trained the last layer of the modified network, exploiting the similarities between the facial recognition domain and CAH facial analysis. This training scheme prevents the network from overfitting on the training data set. The optimization process uses stochastic gradient descent with an initial learning rate of 0.05 using a cross-entropy loss. We trained the network for 20 epochs

Data and Evaluation

The study included 122 individuals with CAH (62 [60.8%] female; mean [SD] age, 11.6 [7.1] years [range, 3 weeks to 29 years), of whom 81 were youths (aged 0-18 years) and 21 were young adults (aged 19 to 29 years); 81 had salt-wasting CAH, and 21 had simple-virilizing CAH. A total of 59 controls (30 [50.8%] female; mean [SD] age, 9.0 [5.2] years [range, 3 weeks to 26 years]) were recruited from the CAH clinic at CHLA. We acquired 993 CAH sample images and 446 control sample images. Among patients with CAH, 60 of 102 (59%) were Hispanic, and among controls, 34 of 59 (57.6%) were Hispanic. We studied 85 additional controls (48 [60%] female) younger than 29 years (1078 sample images) selected from public data sets. The Table summarizes the study population characteristics.

Due to the data set size and to avoid overfitting and bias, we adopted a 6-fold cross-validation strategy in which we divided the data into 6 folds of roughly equal sizes; the images of each subject appeared in only 1 fold to ensure statistical independence of all folds. For each experiment, 1 fold was used for testing, 90% of the remaining 5 folds were used for training, and 10% were used for validation. The distribution of CAH and control sample images was approximately the same among the 6 folds.

Given an input image to the pipeline, a CAH score was predicted by our models, which took values within 0 and 1 [0,1] representing the probability of a test image being CAH. A predicted CAH score closer to 1 indicated a higher probability of having CAH. These predicted CAH scores were binarized using thresholds varied within [0,1]. A false positive-to-negative ratio and true positive–to-positive ratio were calculated using the binarized decisions and then used to measure the performance of the different CAH prediction techniques in terms of area under the curve (AUC) for the receiver operating characteristic curve, which was computed with 95% confidence interval.

Experimental Results

To evaluate group differences using the handcrafted features, we performed 2-tailed t-tests for analysis of the handcrafted features between the CAH and control groups. We considered a 2-sided P < .05 to be statistically significant.

Comparing 27 handcrafted facial features (used for the study of sex differences of the face and the association of prenatal androgens with facial morphologic features) between patients with CAH and controls, we found that 11 of 27 facial features were statistically different between the groups.

The receiver operating characteristic curves for the 6-fold partitioning for CAH classification using 27 handcrafted features from linear discrimination analysis and random forest classifiers are shown in Figure A and B below. We obtained a mean (SD) AUC of 86% (5%) using linear discrimination analysis and a mean (SD) AUC of 83% (3%) using random forest classifiers by calculating the mean AUCs of the 6 folds; this method indicates the ability to differentiate between the features of patients with CAH and controls. Extracting features using VGG16 provided a high prediction accuracy, with a mean (SD) AUC of 92% (3%) by determining the mean of the 6 folds Figure C, thus demonstrating the presence of recognizable facial features that differed between patients with CAH and controls.

Among patients with CAH, the mean (SD) CAH score was similar between Hispanic (0.82 [0.28]) and non-Hispanic (0.81 [0.30]) patients (P = .80). The mean (SD) CAH score was also similar between patients with a Tanner stage of I to II (n = 52; 0.83 [0.28]) and those with a Tanner stage of III to V (n = 50; 0.81 [0.30]) (P = .96). There were no significant differences between the youngest patients (0-6 years; n = 31; mean [SD] score, 0.88 [0.24]) and those aged 7 to 12 years (n = 26; mean [SD] score, 0.76 [0.32]; P = .11), 13 to 18 years (n = 29; mean [SD] score, 0.82 [0.31]; P = .64), and 19 to 29 years (n = 16; mean [SD] score, 0.85 [0.28]; P = .94).

Further, an amalgam was computer-generated by first detecting facial landmarks for all faces in the data set and using these landmarks to align the faces on top of each other by scaling and rotating the images. Aligned faces were then averaged for all females and males within CAH and control groups. These four landmark templates illustrate the differences of facial landmarks between faces of individuals with CAH and faces of controls.