PAL: Partner-Assisted Learning for Few-Shot Image Classification

October 15, 2021

Introduction

Despite the recent impressive success of deep learning in many vision tasks, such as image classification, object recognition, and image segmentation, a well-known weakness of the traditional methods is that a great amount of labeled data is essential for the training. However, data annotation can be expensive, and in many real-world scenarios, only a few samples are available, such as the image of endangered species. 

Few-shot learning (FSL) has been proposed to address this issue by mimicking human vision system, which can efficiently learn the appearance of a novel object by a few instances. In the context of FSL, there are two types of classes: base classes and novel classes. A large number of labeled data is available for the base classes, whereas only a few shot (e.g., 1 or 5) samples are available for each of the novel classes. Meta-Learning has been proposed to simulate few-shot tasks during training by either designing an optimal adaptation algorithm [1] or learning a shared feature space for prototype-based classification [2]. Prototype classification, as shown in Figure 1(a), estimates the prototypes of classes with few labeled samples, so that a new sample can be classified by computing the similarities between all prototypes and performing the nearest neighborhood search. As shown in Figure 1(b), prototype classification can gain benefit if the feature distribution is discriminative between clusters while being compact within each cluster. 

Recent work [3] has shown that pretraining a model with full supervision on the base classes can serve as a strong baseline for novel few-shot tasks by performing prototype classification. However, conventional supervised pretraining might overfit the feature extractor to base classes, where some detailed information, which is discriminative for novel classes rather than base classes, can be suppressed. Knowledge distillation has been adapted by formulating a teacher-student setting [4], where the teacher model can provide soft labels for the student model so that more details can be preserved. Despite its success, the performance improvement is still limited since the teacher model has once been rigidly optimized according to the hard labels (ground truth).

Screenshot 2022-01-14 at 12.17.31 AM

Approach

In this paper, we introduce Partner-Assisted Learning (PAL), a framework for representation learning in the few-shot classification setting. As shown in Figure 1(c), we propose to extract features that can be used to dynamically represent classes, and set those features as soft-anchors to regularize the feature extractor training with hard-anchors from scratch. As shown in Figure 2, PAL consists of a Partner Encoder and Main Encoder, where two encoders are trained in sequence so that the well-trained Partner Encoder can provide regularizations for the training of Main Encoder.  

Partner Encoder is trained on the base classes by using supervised contrastive learning [5] (SupCT) to do clustering and perform pairwise comparisons among all feature instances. The features of the same class are pushed together while the features from different classes are pushed away. Then, the Main Encoder is trained on the base classes by using the cross-entropy loss for classification with the alignment constraints provided by the fixed Partner Encoder. The constraints are provided at two levels: logit-level alignment and feature-level alignment. On the logit-level, a shared classifier is applied to extract the logits for the Partner Encoder and the Main Encoder respectively, where the input to both encoders are of the same class. The alignment is achieved by minimizing the cross-entropy between two predictions. On the feature-level, similar to the idea in supervised contrastive learning, the alignment is achieved by performing pair-wise comparison between the features generated from both encoders. During the few-shot evaluation, we directly use the pre-trained Main Encoder for prototype classification.

Experimental Results

We evaluated PAL on four benchmark datasets: miniImagenet, tieredImagenet, CIFAR-FS, and FC100. Different from regular classification tasks, in the context of FSL, N-way K-shot tasks are commonly used to evaluate the model performance, where each task contains N novel classes and each of the classes contains K labeled samples. As results are shown in Table 1 and Table 2, PAL constantly outperforms the state-of-the-art methods on the four benchmarks, which demonstrates its effectiveness and robustness. 

[1]: Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model agnostic meta-learning for fast adaptation of deep networks. ICML, 2017.

[2]: Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. NeurIPS, 2017.

[3]: Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390, 2020.

[4]: Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? ECCV, 2020.

[5]: Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. NeurIPS, 2020.