Unsupervised Few-shot Learning via Deep Laplacian Eigenmaps Kuilin Chen

2025-05-06 0 0 714.5KB 19 页 10玖币
侵权投诉
Unsupervised Few-shot Learning via Deep Laplacian
Eigenmaps
Kuilin Chen
University of Toronto
kuilin.chen@mail.utoronto.ca
Chi-Guhn Lee
University of Toronto
cglee@mie.utoronto.ca
Abstract
Learning a new task from a handful of examples remains an open challenge in machine learning. Despite
the recent progress in few-shot learning, most methods rely on supervised pretraining or meta-learning
on labeled meta-training data and cannot be applied to the case where the pretraining data is unlabeled.
In this study, we present an unsupervised few-shot learning method via deep Laplacian eigenmaps. Our
method learns representation from unlabeled data by grouping similar samples together and can be intuitively
interpreted by random walks on augmented training data. We analytically show how deep Laplacian
eigenmaps avoid collapsed representation in unsupervised learning without explicit comparison between
positive and negative samples. The proposed method significantly closes the performance gap between
supervised and unsupervised few-shot learning. Our method also achieves comparable performance to
current state-of-the-art self-supervised learning methods under linear evaluation protocol.
1 Introduction
Few-shot learning (Fei-Fei et al.,2006) aims to learn a new classification or regression model on
a novel task that is not seen during training, given only a few examples in the novel task. Existing
few-shot learning methods either rely on episodic meta-learning (Finn et al.,2017,Snell et al.,2017)
or standard pretraining (Chen et al.,2019,Tian et al.,2020b) in a supervised manner to extract trans-
ferrable knowledge to a new few-shot task. Unfortunately, these methods require many labeled
meta-training samples. Acquiring a lot of labeled data is costly or even impossible in practice. Re-
cently, several unsupervised meta-learning approaches have attempted to address this problem by
constructing synthetic tasks on unlabeled meta-training data (Hsu et al.,2019,Khodadadeh et al.,
2019,2021) or meta-training on self-supervised pretrained features (Lee et al.,2021a). However, the
performance of unsupervised meta-learning approaches is still far from their supervised counter-
parts. Empirical studies in supervised pretraining show that representation learning via grouping
similar samples together (Chen et al.,2019,Dhillon et al.,2020,Laenen and Bertinetto,2021,Tian
et al.,2020b) outperforms a wide range of episodic meta-learning methods, where the definition
of similar samples is given by class labels. The motivation of this study is to develop an unsuper-
vised representation learning method by grouping unlabeled meta-training data without episodic
training and close the performance gap between supervised and unsupervised few-shot learning.
Contrastive self-supervised learning has shown remarkable success in learning representation
from unlabeled data, which is competitive with supervised learning on multiple visual tasks (Hé-
naff et al.,2020,Tian et al.,2020a). The common underlying theme behind contrastive learning
is to pull together representation of augmented views of the same training sample (positive sam-
1
arXiv:2210.03595v1 [cs.LG] 7 Oct 2022
Figure 1: A graph on augmented views of unlabeled training data. The thickness of the edge
indicates the transition probability between vertices, which is proportional to their similarity. We
group similar vertices together by minimizing the total transition probability between different
groups.
ple) and disperse representation of augmented views from different training samples (negative
sample) (Wang and Isola,2020,Wu et al.,2018). Typically, contrastive learning methods require a
large size of negative samples to learn high-quality representation from unlabeled data (Chen et al.,
2020,He et al.,2020). This inevitably requires a large batch size of samples, demanding signifi-
cant computing resources. Non-contrastive methods try to overcome the issue by accomplishing
self-supervised learning with only positive pairs. However, non-contrastive methods suffer from
trivial solutions where the model maps all inputs to the same constant vector, known as the col-
lapsed representation. Various methods have been proposed to avoid collapsed representation on
an ad hoc basis, such as asymmetric network architecture (Grill et al.,2020), stop gradient (Chen
2
and He,2021), and feature decorrelation (Ermolov et al.,2021,Hua et al.,2021,Zbontar et al.,2021).
However, theoretical understanding about how non-contrastive methods avoid collapsed represen-
tation is limited, though some preliminary attempts are made to analyze the training dynamics of
non-contrastive methods (Tian et al.,2021). Besides, most self-supervised learning methods focus
on the linear evaluation task where the training and test data come from the same classes. They
do not account for the domain gap between training and test classes, which is the case in few-shot
learning and cross-domain few-shot learning.
We develop a novel unsupervised representation learning method in which a weighted graph
is used to capture unlabeled samples as nodes and similarity among samples as the weights of
edges. Two samples are deemed similar if they are augmentations of a single sample and cluster-
ing of samples is accomplished by partitioning the graph. We provide an intuitive understanding
of the graph partition problem from the perspective of random walks on the graph, where the
transition probability between two vertices is proportional to their similarity. The optimal parti-
tion can be found by minimizing the total transition probability between clusters. It is linked to
the well-known Laplacian eigenmaps in spectral analysis (Belkin and Niyogi,2003,Meila and Shi,
2000,Shi and Malik,2000). We replace the locality-preserving projection in Laplacian eigenmaps
with deep neural networks for better scalability and flexibility in learning high-dimensional data
such as images.
An additional technique is integrated into deep Laplacian eigenmaps to handle the domain gap
between the meta-training and meta-testing sets. Previous studies on word embeddings (e.g. king
- man + woman queen) (Mikolov et al.,2013) and disentangled generative models (Karras et al.,
2019) show that interpolation between latent embeddings may correspond to the representation
of a realistic sample, which may not be seen in the training data. In contrast, interpolation in the
input space does not result in realistic samples. In parallel, interpolation between the distributions
of the nearest two meta-training classes in the embedding space can approximate the distribution
of one meta-testing class after the feature extractor is trained (Yang et al.,2021). To enhance the
performance on downstream few-shot learning tasks, we make interpolation of unlabeled meta-
training samples on data manifold to mimic unseen meta-test samples and integrate them into
unsupervised training of the feature extractor.
Our contributions are summarized as follows:
A new unsupervised few-shot learning method is developed based on deep Laplacian eigen-
maps with an intuitive explanation based on random walks.
Our loss function is analyzed to show how collapsed representation is avoided without ex-
plicit comparison to negative samples, shedding light on existing feature decorrelation based
self-supervised learning methods.
The proposed method significantly closes the performance gap between unsupervised and
supervised few-shot learning methods.
Our method achieves comparable performance to current state-of-the-art (SOTA) self-
supervised learning methods under the linear evaluation protocol.
2 Methodology
2.1 Graph from augmented data
First, we construct a graph using augmented views of unlabeled data. Let ¯
xRdbe a raw
sample without augmentation. For image data, augmented views are created by the commonly
3
used augmentations defined in SimCLR (Chen et al.,2020), including horizontal flip, Gaussian
blur, color jittering, and random cropping. Xdenotes the set of all augmented data and N=|X |.
We represent the augmented data in the form of a weighted graph G= (X, S), where each x∈ X is
a vertex of the graph and Sdenotes the edge weights. The edge between two vertices xi,xj∈ X is
weighted by the non-negative similarity sij between them. For unrelated xiand xj, the similarity
sij should be small. On the other hand, the similarity sij should be large when xiand xjare
augmentations from the same image or augmentations from two images within the same latent
classes. An illustrative diagram is shown in Fig. 1.
Let di=Pxj∈X sij denote the total weights associated with xi, which is the degree of a vertex
xiin a weighted graph. The degree matrix Dis defined as the diagonal matrix with the degrees
d1, ..., dNon the diagonal. The volume of Xis Vol(X) = Pxi∈X di. Similarly, the volume of a
subset C⊂ X is defined as Vol(C) = PxiCdi. Let P=D1Sbe the random walk Laplacian of G,
where pij = (P)ij represents the transition probability from xito xj, and each row of Psums to 1.
L=DSis the unnormalized Laplacian of G.
2.2 Random walks and graph partition
Xcan be grouped into Kclusters, where similar vertices should be grouped into the same clus-
ter and dissimilar vertices should be grouped into different clusters. It resembles the supervised
pretraining by embedding samples from the same class together. We will show later that Kis also
the dimension of the embedding. Since we do not know the number of classes in unlabeled training
data, we set the embedding dimension K= 2048, which works well on a wide range of datasets.
Graph partition into clusters can be done by minimizing the total similarity PxiC,xjC0sij between
two clusters C, C0 X ,CC0=. However, minimizing the total inter-cluster similarity is unde-
sirable because it can simply separate one individual vertex from the rest of the graph. Instead, we
run random walks on vertices X.
The random walk Laplacian Pdefines a Markov chain on the vertices X. The stationary dis-
tribution πof this chain has an explicit form πi=di/Vol(X)for xi∈ X (Meila and Shi,2000). The
transition probability P(C0|C) = P(X1C0|X0C)is given by
PX1C0|X0C
=
1
Vol(X)X
xiC,xjC0
sij
Vol(C)
Vol(X)1
=PxiC,xjC0sij
Vol(C)
(1)
The inter-cluster transition probability is a proper criterion because it has a small value only if
PxiC,xjC0sij is small (low similarity for vertices in different clusters) and all clusters have suffi-
ciently large volumes. As a result, minimization of the inter-cluster transition probability prevents
trivial solutions that simply separate one individual vertex from the rest of the graph.
The inter-cluster transition probability minimizing problem is a constrained optimization prob-
lem. For the case of finding Kclusters, we define a matrix ZRN×K, where zik represents the
cluster assignment of the vertex xi.
zik =1/pvol (Ck)if xiCk
0otherwise (2)
4
摘要:

UnsupervisedFew-shotLearningviaDeepLaplacianEigenmapsKuilinChenUniversityofTorontokuilin.chen@mail.utoronto.caChi-GuhnLeeUniversityofTorontocglee@mie.utoronto.caAbstractLearninganewtaskfromahandfulofexamplesremainsanopenchallengeinmachinelearning.Despitetherecentprogressinfew-shotlearning,mostmethod...

展开>> 收起<<
Unsupervised Few-shot Learning via Deep Laplacian Eigenmaps Kuilin Chen.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:714.5KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注