certain degree of task awareness. Such approaches exploit
the information available in the query set during the training
and testing phases [
9
,
54
,
57
] to alleviate the model’s sample
bias. As a result, the model learns to generate task-specific
embeddings by better aligning the features of the support and
query samples for optimal metric based label assignment.
Some other supervised approaches do not rely purely on
convolutional feature extractors. Instead, they use graph
neural networks (GNN) to model instance-level and class-
level relationships [
26
,
37
,
55
,
58
]. This is owing to the fact
that GNN’s are capable of exploiting the manifold structure
of the novel classes [
52
]. However, looking at the recent
literature, one can barely see any GNN based architectures
being used in the unsupervised setting.
Recent unsupervised methods use a successful form of
contrastive learning [
6
] in their self-supervised pre-training
phase. Contrastive learning methods typically treat each
image in a batch as its own class. The only other images
that share the class are the augmentations of the image in
question. Such methods enforce similarity of representations
between pairs of an image and its augmentations (positive
pairs), while enforcing dissimilarity between all other pairs
of images (negative pairs) through a contrastive loss. Al-
though these methods work well, they overlook the possi-
bility that within a randomly sampled batch of images there
could be several images (apart from their augmentations)
that in reality belong to the same class. By applying the con-
trastive loss, the network may inadvertently learn different
representations for such images and classes.
To address this problem, recent methods such as SimCLR
[
6
] introduce larger batch sizes in the pre-training phase to
maximize the number of negative samples. However, this
approach faces two shortcomings: (i) increasingly larger
batch sizes, mandate more costly training infrastructure, and
(ii) it still does not ingrain intra-class dependencies into
the network. Point (ii) still applies to even more recent
approaches, such as ProtoCLR [
32
]. A simple yet effective
remedy of this problem proposed in C
3
LR [
39
] where an
intermediate clustering and re-ranking step is introduced,
and the contrastive loss is accordingly adjusted to ingest
a semblance of class-cognizance. However, the problem
could be approached from a different perspective, where the
network explores the structure of data samples per batch.
We propose a novel U-FSL approach (coined as
SAMPTransfer
) that marries the potential of GNNs in
learning the global structure of data in the pre-training stage,
and the efficiency of optimal transport (OT) for inducing
task-awareness in the following fine-tuning phase. More
concretely, with
SAMPTransfer
we introduce a novel self-
attention message passing contrastive learning (
SAMP-CLR
)
scheme that uses a form of graph attention allowing the
network to learn refined representations by looking beyond
single-image instances per batch. Furthermore, the proposed
OT based fine-tuning strategy (we call
OpT-Tune
) aligns
the distributions of the support and query samples to improve
downstream adaptability of the pre-trained encoder, without
requiring any additional parameters. Our contributions can
be summarized as:
1.
We propose
SAMPTransfer
, a novel U-FSL ap-
proach that introduces a self-attention message passing
contrastive learning (
SAMP-CLR
) paradigm for unsu-
pervised few-shot pre-training.
2.
We propose applying an optimal transport (OT) based
fine-tuning (
OpT-Tune
) strategy to efficiently induce
task-awareness in both fine-tuning and inference stages.
3.
We present a theoretical foundation for
SAMPTransfer
, as well as extensive experimental
results corroborating the efficacy of
SAMPTransfer
,
and setting a new state-of-the-art (to our best knowl-
edge) in both miniImageNet and tieredImageNet
benchmarks, we also report competitive performance
on the challenging CDFSL benchmark [20].
2. Related Work
Self-Supervised learning
. Self-supervised learning
(SSL) is a term used for a collection of unsupervised meth-
ods that obtain supervisory signals from within the data
itself, typically by leveraging the underlying structure in the
data. The general technique of self-supervised learning is to
predict any unobserved (or property) of the input from any
observed part. Several recent advances in the SSL space have
made waves by eclipsing their fully supervised counterparts
[
18
]. Some examples of seminal works include SimCLR [
6
],
BYOL [
19
], SWaV [
5
], MoCo [
21
], and SimSiam [
7
]. Our
pre-training method
SAMP-CLR
is inspired by SimCLR [
6
],
ProtoTransfer [32] and C3LR [39].
Metric learning
. Metric learning aims to learn a repre-
sentation function that maps the data to an embedding space.
The distance between objects in the embedding space must
preserve their similarity (or dissimilarity) - similar objects
are closer, while dissimilar objects are farther. For example,
unsupervised methods based on some form of contrastive
loss, such as SimCLR [
6
] or NNCLR [
15
], guide objects
belonging to the same potential class to be mapped to the
same point and those from different classes to be mapped to
different points. Note that in an unsupervised setting, each
image in a batch is its own class. This process generally in-
volves taking two crops of the same image and encouraging
the network to emit an identical representation for the two,
while ensuring that the representations remain different from
all other images in a given batch. Metric learning methods
have been shown to work quite well for few-shot learning.
AAL-ProtoNets [
1
], ProtoTransfer [
32
], UMTRA [
25
], and
certain GNN methods [
37
] are excellent examples that use
metric learning for few-shot learning.