SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation

August 27, 2021


Semantic Segmentation

Human beings can easily understand the scenes that we see through our eyes. For example, in the bedroom, we know where the bed and TV are   (Fig 1). Such a process for a computer to understand the scene and locate the objects is called semantic segmentation.

Semantic segmentation has many applications in our life. For example, in self-driving cars, the system needs to understand the road, traffic signs, pedestrians, and so on in order to decide when and where to go. In medicine, researchers can use semantic segmentation to analyze medical images, such as magnetic resonance imaging, to find potential tumors.

In Computer Vision, people usually regard semantic segmentation as a special classification problem. In the traditional classification problem, the system only needs to give one label to a given figure. While in semantic segmentation, the system classifies all pixels in a dense annotation way.

Out-of-distribution Samples and Zero-shot Learning

For a machine learning model, it can perform well on in-distribution samples (i.e. classes/objects the model is trained on). While in real-life applications, it is hard to collect a perfect training dataset including all objects the model may come across in practice. These objects only appear during testing are called out-of-distribution (OOD) samples. When training a model, we would like to entail the model’s ability to recognize unseen/OOD objects so that the system, e.g. self-driving cars, won’t crush or make disastrous decisions when coming up against an OOD sample. 

Formally, the label space of images is separated into two parts: seen categories CSand unseen categories CU. The training set contains only in-distribution samples DS={(x, y) | ∀i yiCS}, where x is an image and y is the corresponding ground-truth label. We use yi to indicate the label of ith pixel. Other images including unseen categories are denoted by DU={(x, y) | ∃i yiCU}. In the zero-shot semantic segmentation problem, the model is trained on a subset DtrainDS and tested on subset DtestDSDU.

Zero-shot Semantic Segmentation

The mainstream solutions of zero-shot semantic segmentation share ideas from zero-shot classification problems. The key problem to solve is generalizing the model to unseen classes. The common approach widely used in zero-shot learning is mining the relationship (e.g. embedding similarity) between classes from language task (e.g. pre-trained word2vec[2] language model) and letting the model learns the same class relation implicitly or explicitly during training.  In practice, we use word2vec and fasttext[8] for word embedding.

Bucher[3], Gu[4], and Li[5] generate synthetic features for unseen classes to train the model. These features are synthesized with word2vec embedding and random noise. Our SIGN model is also a generative model that produces synthetic features for unseen classes. There are also non-generative methods, for example,  Xian et al.[6] encourage the model to produce similar features to word2vec embedding by directly using word2vec as the last layer model classifier.


SIGN Model

As we aforementioned, the SIGN model is a GAN-based model. The model has five learnable networks and one unlearnable network. 

Fig 2. (a-c) Three training steps of SIGN model. (d) The architecture of SIM module.

The functionality of networks.

    • E: The feature encoder CNN using Deeplab-v2 architecture[7]
    • G: The generator that synthesizes features for unseen classes
    • D: The discriminator in GAN training to distinguish between real features (features from SIM) and synthetic features (features from G).
    • C: The last layer of the model functions as a classifier.
    • SIM: The spatial information module that produces spatial latent code for the generator
    • M: The unlearnable mapping network converts annotation index (e.g. 0, 1, …) into word embeddings. It can be seen as a dictionary with integer indexes and word embedding values.

The high-level idea of training is that we first (Fig 2.a) train a standard semantic segmentation model that performs well on seen classes. In the meantime, we train the SIM module so that it can produce a Gaussian-like spatial latent code. Next (Fig 2.b), we train the generator in a GAN way so that the synthetic features incorporate spatial information. Finally (Fig 2.c), we use both real features and synthetic features to train the classifier to generalize on unseen classes. (Note, real features are not shown in Fig 2.c)

The  SIM module (Fig 2.d) takes as input the image features and 2D Relative Positional Encoding (RPE). Positional encoding[9] is a common NLP technique to indicate the location of a word in a sentence. We borrow this idea and adapt it to the image domain. We make two improvements (Fig 3): i) 2D encoding, a pixel’s encoding consists of two vectors, indicating horizontal and vertical location; ii) relative encoding, a pixel is encoded by its relative location in the feature in order to handle variant image size during testing. The  SIM module outputs features for classifier C and a Gaussian distributed spatial latent code. The Gaussian distributed property of spatial latent code is achieved by minimizing its KL-divergence to Gaussian noise.

Fig 3. Left: PE in NLP. Middle, 2D PE with pixel’s absolute location. Right, 2D RPE with pixel’s relative location in the feature, good for variant image size during testing. 

Once the encoder E and SIM modules are trained, we freeze their model weights and train the generator G (Fig 2.b). The generator is optimized with a GAN loss so that (pixel-level) synthetic feature is indistinguishable from real features and a Maximum Mean Discrepancy loss so that the overall distribution of (class-level) synthetic feature is close to that of real features. Compared to previous GAN-based methods, the generator in SIGN model uses the concatenation of semantic latent code (word embedding) and spatial latent code to generate synthetic features.

Fig 4. a) Gaussian noise prior in [3]. b) Contextual prior in [4]. c) Our spatial prior.

Annealed Self-Training (AST)

In the zero-shot learning problem, unlabeled samples, including unseen classes, are sometimes available. Trained models can annotate unlabeled samples to obtain additional training data and fine-tune the models with pseudo-annotated data. Such a training strategy is called self-training.

We propose a knowledge distillation-inspired self-training strategy, namely, Anneal Self-Training to generate better pseudo-annotations for self-training. 

For each pseudo-annotation, AST assigns a pseudo-label weight to it. Higher weighted labels will have more impact on model optimization. The actual weights are computed by an annealed soft-max function (Fig 5) where predictions with higher probability will be assigned higher weights.

Fig 5. Pseudo-label weights assigned by Annealed Self-Training. 

Experimental Results

We evaluate the performance on three datasets and compare with three baselines. The detailed evaluation protocols can be found in the paper. We use mean intersection-over-union (mIoU) as the evaluation metric and measure mIoU on seen, unseen classes and the harmonic average of seen and unseen score. (Table 1 & Fig 5)

Table 1. Zero-shot semantic segmentation performance. ST: self-training, AST: annealed self-training. SPNet[6], ZS3[3], CaGNet[4]

Fig 5. Qualitative comparison on Pascal VOC and COCO Stuff

We also show the performance improvement of relative positional encoding (Table 2) and annealed self-training (Fig 6)

Table 2. mIoU of model on Pascal VOC. APE: absolute positional encoding. APE w/Inter: absolute positional encoding with interpolation when image size change. RPE: relative positional encoding.

Screenshot 2022-01-12 at 4.52.16 PM

Fig 6. Qualitative comparison of predictions without selftraining, with self-training and our AST