Self-Attention Message Passing for Contrastive Few-Shot Learning Ojas Kishorkumar Shirekar12 Anuj Singh12 Hadi Jamali-Rad12 1Delft University of Technology TU Delft The Netherlands

2025-04-26 0 0 2.84MB 11 页 10玖币

侵权投诉

Self-Attention Message Passing for Contrastive Few-Shot Learning

Ojas Kishorkumar Shirekar1,2, Anuj Singh1,2, Hadi Jamali-Rad1,2

1Delft University of Technology (TU Delft), The Netherlands

2Shell Global Solutions International B.V., Amsterdam, The Netherlands

{o.k.shirekar, a.r.singh}@student.tudelft.nl, h.jamalirad@tudelft.nl

Abstract

Humans have a unique ability to learn new represen-

tations from just a handful of examples with little to no

supervision. Deep learning models, however, require an

abundance of data and supervision to perform at a sat-

isfactory level. Unsupervised few-shot learning (U-FSL)

is the pursuit of bridging this gap between machines and

humans. Inspired by the capacity of graph neural net-

works (GNNs) in discovering complex inter-sample rela-

tionships, we propose a novel self-attention based mes-

sage passing contrastive learning approach (coined as

SAMP-CLR

) for U-FSL pre-training. We also propose an

optimal transport (OT) based ﬁne-tuning strategy (we call

OpT-Tune

) to efﬁciently induce task awareness into our

novel end-to-end unsupervised few-shot classiﬁcation frame-

work (

SAMPTransfer

). Our extensive experimental re-

sults corroborate the efﬁcacy of

SAMPTransfer

in a vari-

ety of downstream few-shot classiﬁcation scenarios, setting

a new state-of-the-art for U-FSL on both miniImageNet and

tieredImageNet benchmarks, offering up to

7%+

and

5%+

improvements, respectively. Our further investigations also

conﬁrm that

SAMPTransfer

remains on-par with some

supervised baselines on miniImageNet and outperforms all

existing U-FSL baselines in a challenging cross-domain sce-

nario. Our code can be found in our GitHub repository:

https://github.com/ojss/SAMPTransfer/.1

1. Introduction

Deep learning models have become increasingly large and

data hungry to be able to guarantee acceptable downstream

performance. Humans need neither a ton of data samples

nor extensive forms of supervision to understand their sur-

roundings and the semantics therein. Few-shot learning has

garnered an upsurge of interest recently as it underscores

this fundamental gap between humans’ adaptive learning

capacity compared to data-demanding deep learning meth-

ods. In this realm, few-shot classiﬁcation is cast as the task

of predicting class labels for a set of unlabeled data points

1This paper is accepted to appear in the proceedings of WACV 2023.

(query set) given only a small set of labeled ones (support

set). Typically, query and support data samples are drawn

from the same distribution.

Few-shot classiﬁcation methods usually consist of two

sequential phases: (i) pre-training on a large dataset of base

classes, regardless of this pre-training being supervised or

unsupervised. This is followed by (ii) ﬁne-tuning on an

unseen dataset consisting of novel classes. Normally, the

classes used in the pre-training and ﬁne-tuning are mutually

exclusive. In this paper, we focus on the self-supervised

setting (also interchangeably called “unsupervised” in litera-

ture) where we have no access to the actual class labels of

the “base” dataset. Our motivation to tackle unsupervised

few-shot learning (U-FSL) is that it poses a more realistic

challenge, closer to humans’ learning process.

The body of work around U-FSL can be broadly classi-

ﬁed into two different approaches. The ﬁrst approach relies

on the use of meta-learning and episodic pre-training that

involves the creation of synthetic “tasks” to mimic the subse-

quent episodic ﬁne-tuning phase [

–

]. The

second approach follows a transfer learning strategy, where

the network is trained non-episodically to learn optimal rep-

resentations in the pre-training phase from the abundance

of unlabeled data and is then followed by an episodic ﬁne-

tuning phase [

]. To be more speciﬁc, a feature

extractor is ﬁrst pre-trained to capture the structure of un-

labeled data (present in base classes) using some form of

representation learning [

]. Next, a prediction layer

(a linear layer, by convention) is ﬁne-tuned in conjunction

with the pre-trained feature extractor for a swift adaptation

to the novel classes. The better the feature extractor models

the distribution of the unlabeled data, the less the predictor

requires training samples, and the faster it adapts itself to

the unseen classes in the ﬁne-tuning and eventual testing

phases. Some recent studies [

] argue that transfer

learning approaches outperform meta-learning counterparts

in standard in-domain and cross-domain settings, where base

and novel classes come from totally different distributions.

On the other side of the aisle, supervised FSL approaches

that follow the episodic training paradigm may include a

arXiv:2210.06339v1 [cs.CV] 12 Oct 2022

certain degree of task awareness. Such approaches exploit

the information available in the query set during the training

and testing phases [

] to alleviate the model’s sample

bias. As a result, the model learns to generate task-speciﬁc

embeddings by better aligning the features of the support and

query samples for optimal metric based label assignment.

Some other supervised approaches do not rely purely on

convolutional feature extractors. Instead, they use graph

neural networks (GNN) to model instance-level and class-

level relationships [

]. This is owing to the fact

that GNN’s are capable of exploiting the manifold structure

of the novel classes [

]. However, looking at the recent

literature, one can barely see any GNN based architectures

being used in the unsupervised setting.

Recent unsupervised methods use a successful form of

contrastive learning [

] in their self-supervised pre-training

phase. Contrastive learning methods typically treat each

image in a batch as its own class. The only other images

that share the class are the augmentations of the image in

question. Such methods enforce similarity of representations

between pairs of an image and its augmentations (positive

pairs), while enforcing dissimilarity between all other pairs

of images (negative pairs) through a contrastive loss. Al-

though these methods work well, they overlook the possi-

bility that within a randomly sampled batch of images there

could be several images (apart from their augmentations)

that in reality belong to the same class. By applying the con-

trastive loss, the network may inadvertently learn different

representations for such images and classes.

To address this problem, recent methods such as SimCLR

[

] introduce larger batch sizes in the pre-training phase to

maximize the number of negative samples. However, this

approach faces two shortcomings: (i) increasingly larger

batch sizes, mandate more costly training infrastructure, and

(ii) it still does not ingrain intra-class dependencies into

the network. Point (ii) still applies to even more recent

approaches, such as ProtoCLR [

]. A simple yet effective

remedy of this problem proposed in C

LR [

] where an

intermediate clustering and re-ranking step is introduced,

and the contrastive loss is accordingly adjusted to ingest

a semblance of class-cognizance. However, the problem

could be approached from a different perspective, where the

network explores the structure of data samples per batch.

We propose a novel U-FSL approach (coined as

SAMPTransfer

) that marries the potential of GNNs in

learning the global structure of data in the pre-training stage,

and the efﬁciency of optimal transport (OT) for inducing

task-awareness in the following ﬁne-tuning phase. More

concretely, with

SAMPTransfer

we introduce a novel self-

attention message passing contrastive learning (

SAMP-CLR

)

scheme that uses a form of graph attention allowing the

network to learn reﬁned representations by looking beyond

single-image instances per batch. Furthermore, the proposed

OT based ﬁne-tuning strategy (we call

OpT-Tune

) aligns

the distributions of the support and query samples to improve

downstream adaptability of the pre-trained encoder, without

requiring any additional parameters. Our contributions can

be summarized as:

We propose

SAMPTransfer

, a novel U-FSL ap-

proach that introduces a self-attention message passing

contrastive learning (

SAMP-CLR

) paradigm for unsu-

pervised few-shot pre-training.

We propose applying an optimal transport (OT) based

ﬁne-tuning (

OpT-Tune

) strategy to efﬁciently induce

task-awareness in both ﬁne-tuning and inference stages.

We present a theoretical foundation for

SAMPTransfer

, as well as extensive experimental

results corroborating the efﬁcacy of

SAMPTransfer

and setting a new state-of-the-art (to our best knowl-

edge) in both miniImageNet and tieredImageNet

benchmarks, we also report competitive performance

on the challenging CDFSL benchmark [20].

2. Related Work

Self-Supervised learning

. Self-supervised learning

(SSL) is a term used for a collection of unsupervised meth-

ods that obtain supervisory signals from within the data

itself, typically by leveraging the underlying structure in the

data. The general technique of self-supervised learning is to

predict any unobserved (or property) of the input from any

observed part. Several recent advances in the SSL space have

made waves by eclipsing their fully supervised counterparts

[

]. Some examples of seminal works include SimCLR [

BYOL [

], SWaV [

], MoCo [

], and SimSiam [

]. Our

pre-training method

SAMP-CLR

is inspired by SimCLR [

ProtoTransfer [32] and C3LR [39].

Metric learning

. Metric learning aims to learn a repre-

sentation function that maps the data to an embedding space.

The distance between objects in the embedding space must

preserve their similarity (or dissimilarity) - similar objects

are closer, while dissimilar objects are farther. For example,

unsupervised methods based on some form of contrastive

loss, such as SimCLR [

] or NNCLR [

], guide objects

belonging to the same potential class to be mapped to the

same point and those from different classes to be mapped to

different points. Note that in an unsupervised setting, each

image in a batch is its own class. This process generally in-

volves taking two crops of the same image and encouraging

the network to emit an identical representation for the two,

while ensuring that the representations remain different from

all other images in a given batch. Metric learning methods

have been shown to work quite well for few-shot learning.

AAL-ProtoNets [

], ProtoTransfer [

], UMTRA [

], and

certain GNN methods [

] are excellent examples that use

metric learning for few-shot learning.

fΩ=fθ◦fψ

Figure 1:

SAMP-CLR

schematic view and pre-training procedure. In the ﬁgure,

xi,a

is an image sampled from the augmented

set A. The p-message passing steps reﬁne the features extracted using a CNN encoder.

fΩ

Figure 2: Features extracted from the pre-trained CNN are used to build a graph. The features are ﬁrst reﬁned using the

pre-trained SAMP layer(s). Then OpT-Tune aligns support features with query features.

Graph Neural Networks for FSL

. Since the ﬁrst use of

graphs for FSL in [

], there have been several advance-

ments and continued interest in using graphs for supervised

FSL. In [

], each node corresponds to one instance (labeled

or unlabeled) and is represented as the concatenation of a

feature embedding and a label embedding. The ﬁnal layer

of their model is a linear classiﬁer layer that directly outputs

the prediction scores for each unlabeled node. There has also

been an increase in methods that use transduction. TPN [

]

is one of those methods that uses graphs to propagate labels

[

] from labeled samples to unlabeled samples. Although

methods such as EGNN [

] make use of edge and node fea-

tures, earlier methods focused only on using node features.

Graphs are attractive, as they can model intra-batch relations

and can be extended for transduction, as used in [

In addition to transduction and relation modeling, graphs

are highly potent as task adaptation modules. HGNN [

]

is an example in which a graph is used to reﬁne and adapt

feature embeddings. It must be noted that most graph-based

methods have been applied in the supervised FSL setting.

To the best of our knowledge, we are the ﬁrst to use it in

any form for U-FSL. More speciﬁcally, we use a message

passing network as a part of our network architecture and

pre-training scheme.

3. Proposed Method (SAMPTransfer)

In this section, we ﬁrst describe our problem formulation.

We then discuss the two subsequent phases of the proposed

approach: (i) self-supervised pre-training (

SAMP-CLR

and (ii) the optimal transport based episodic supervised

ﬁne-tuning (

OpT-Tune

). Together, these two phases con-

stitute our overall approach, which we have coined as

SAMPTransfer

. The mechanics of the proposed pre-

training and ﬁne-tuning procedures are illustrated in Figs. 1

and 2, respectively.

3.1. Preliminaries

Let us denote the training data of size

Dtr =

{(xi, yi)}D

i=1

with

(xi, yi)

representing an image and its

class label, respectively. In the pre-training phase, we sam-

ple

random images from

Dtr

and augment each sam-

ple

times by randomly sampling augmentation functions

ζa(.),∀a∈[A]

from the set

. This results in a mini-batch

of size

B= (A+ 1)L

total samples. Note that in the unsu-

pervised setting, we have no access to the data labels in the

pre-training phase. Next, we ﬁne-tune our model episodi-

cally [

] on a set of randomly sampled tasks

drawn from

the test dataset

Dtst ={(xi, yi)}D′

i=1

of size

D′

. A task,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-AttentionMessagePassingforContrastiveFew-ShotLearningOjasKishorkumarShirekar1,2,AnujSingh1,2,HadiJamali-Rad1,21DelftUniversityofTechnology(TUDelft),TheNetherlands2ShellGlobalSolutionsInternationalB.V.,Amsterdam,TheNetherlands{o.k.shirekar,a.r.singh}@student.tudelft.nl,h.jamalirad@tudelft.nlAbst...

展开>> 收起<<

Self-Attention Message Passing for Contrastive Few-Shot Learning Ojas Kishorkumar Shirekar12 Anuj Singh12 Hadi Jamali-Rad12 1Delft University of Technology TU Delft The Netherlands.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-Attention Message Passing for Contrastive Few-Shot Learning Ojas Kishorkumar Shirekar12 Anuj Singh12 Hadi Jamali-Rad12 1Delft University of Technology TU Delft The Netherlands

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: