Towards Learning Structure via Consensus for Face Segmentation and Parsing

March 28, 2020

Introduction

Face segmentation and parsing are extremely useful technologies because their output masks can be used to enable next-generation face analysis, enhanced face swapping, more complex face editing, and face completion. Face parsing differs from scene object segmentation in that faces are roughly size and translation invariant, despite being closely related to generic semantic segmentation and using the same methodology.

While publicly accessible state-of-the-art models execute face segmentation, they place a greater emphasis on constructing complicated architectures or a complex face augmenter to simulate occlusions, or they use adversarial training instead. As shown in the figure above, they accomplish it by modeling only two classes and making highly sparse and noncontinuous predictions (face, non-face). Furthermore, when using pixel-wise softmax and cross-entropy to train a network for structured prediction, the strong and over simplistic assumption that pixel predictions are independent and identically distributed is made.

Approach

While publicly accessible state-of-the-art models execute face segmentation, they place a greater emphasis on constructing complicated architectures or a complex face augmenter to simulate occlusions, or they use adversarial training instead. As shown in the figure above, they accomplish it by modeling only two classes and making highly sparse and noncontinuous predictions (face, non-face). Furthermore, when using pixel-wise softmax and cross-entropy to train a network for structured prediction, the strong and over simplistic assumption that pixel predictions are independent and identically distributed is made.

We proposed a novel face segmentation approach based on the concept of learning structure through consensus. The central concept of our strategy is depicted in the diagram above. Regular training works pixel by pixel, enforcing densely each pixel to suit the label without considering the object’s smoothness, resulting in sparse prediction for unseen objects. However, we employ pixel-wise labels for an image, and force the expected prediction in a blob to the label, while ensuring that no pixel deviates from the average.

We can isolate occlusions from the background using our method, which is based on an essential insight about the assumption of pixel-wise prediction independence. The network is better regularized for segmenting with less sparse predictions. We can also arrive at considerably more stable and robust predictions, which are difficult to achieve with a pixel-wise loss.

Inspired by Gestalt laws of proximity, closure, and continuation, we factorize out occlusions by means of the difference between the complete face shape via 3D projections and a preexistent face segmentation network, as shown in the figure below. Meanwhile, we leverage the connected components of the objects factorized before to formulate a new loss function that still performs dense classification and enforces network structure through consensus learning.

Our objective is to robustly learn a nonlinear function parameterized by the weights of a convolutional neural network that maps pixel image intensities to a mask that represents per-pixel semantic label probabilities of the face. We overlook the regular structure present in nature and simply optimize for a cost that does not explicitly back-propagate any smoothness into the network parameters.

Experimental Results

Table: Part Labels set. The comparison of pixel and superpixel accuracies (accp,accsp). The input size and usage of smoothness via CRF are emphasized. The best result is in bold, while the second-best is underlined.

Despite using a lightweight model, as shown in the table above, our system reports results on par with or better than state-of-the-art methods on the COFW (Caltech Occluded Faces in the Wild) set and shows comparable (second best) results on the Part Labels set, noting that in our case we perform direct inference and we are not forcing any smoothness via CRF (conditional random field) at test-time. Notably, when compared to active research on adversarial training, our approach yields similar results.

When compared to the pixel-wise baseline, our method learns a more regular and smooth structure, which results in a more regular mask, as demonstrated in the figure below. Our hair segmentation has fewer holes and fractured segments than the baseline, yet it still produces an excellent face segmentation.