2 M. Afham et al.
challenge, few-shot learning (FSL) is introduced for image classification which
can learn and generalize from limited data.
The main paradigm of FSL is training a model on the base classes and requir-
ing it to accurately classify the novel classes with a limited number of examples,
which is still threatened by data scarcity. There are various initial line of works
study the problem of few-shot learning for image classification [22,18,9,6] and
establish strong baselines to improve on top. Meta-learning used to be predomi-
nant approach to solve FSL then. However, some recent works adopted standard
supervision setting [20] along with various self-supervised approaches [15,10,19]
to enhance the quality of the results. However, it is to be noted that visual cate-
gories being identified only using class labels (numerical IDs) will seriously limit
the contextual features of the category since only a limited number of examples
are provided. Identifying this gap, recent line of works [24,17,12,2] adapted us-
ing semantic features as a prior knowledge or an auxiliary training mechanism
to enhance the FSL performance. RS-FSL [2] is the recent among all to lever-
age categorical descriptions to perform few-shot image classifcation. However,
it is to be noted that our method utilizes contrastive multimodal alignment for
FSL which has never been used in the literature to the best of our knowledge.
Further, our approach investigates both visual and semantic attributes in the
feature level while RS-FSL predicts the descriptions using the hybrid prototype.
The goal of our work is to capture the detailed semantic features and feed it to
the visual feature extractor which can then be easily adopted novel categories
with very few examples.
In this work we study the effectiveness of contrastive learning which has
been proved to perform well [5,4] in standard self-supervised learning. It has
also been adapted to multimodal setting as well [14,13,1]. We utilize the simple
contrastive learning objective as an auxilliary training mechanism in addition to
the standard FSL baseline to provide the contextual knowledge to the model via
the semantic prototype generated using a designated semantic feature extractor.
We align both the semantic and visual prototypes of each class during an episode
of training and employ the contrastive learning learning objective such that the
corresponding prototypes regardless of the modalities to be embedded close to
each other in the multimodal embedding space. This facilitates a prior knowledge
to the visual feature extractor on the semantic attributes of the visual category
which is crucial in few-shot image classification.
The major contribution of this approach can be summarized as follows:
–We show that a simple contrastive alignment of visual and semantic feature
vectors in the embedding space formulates a generalizable visual understand-
ing to perform few-shot image classification.
–We introduce an auxiliary contrastive learning objective on top of the exist-
ing FSL approach, hence our method is a more generic approach and can be
plugged into any of the FSL baselines.
–Our experimental results on two standard FSL benchmarks show that mul-
timodal contrastive alignment improves the performance of the standard
baselines in FSL problem.