Adaptive Distribution Calibration for Few-Shot Learning with Hierarchical Optimal Transport Dandan Guo1 Long Tian2 He Zhao3 Mingyuan Zhou4 Hongyuan Zha15

2025-05-01 0 0 2.16MB 20 页 10玖币
侵权投诉
Adaptive Distribution Calibration for Few-Shot
Learning with Hierarchical Optimal Transport
Dandan Guo 1, Long Tian2, He Zhao 3, Mingyuan Zhou4, Hongyuan Zha1,5
1The Chinese University of Hong Kong, Shenzhen 2Xidian University
3CSIRO’s Data61 4The University of Texas at Austin
5School of Data Science, Shenzhen Institute of Artificial Intelligence and Robotics for Society
guodandan@cuhk.edu.cn tianlong@xidian.edu.cn he.zhao@ieee.org
mingyuan.zhou@mccombs.utexas.edu zhahy@cuhk.edu.cn
Abstract
Few-shot classification aims to learn a classifier to recognize unseen classes during
training, where the learned model can easily become over-fitted based on the
biased distribution formed by only a few training examples. A recent solution
to this problem is calibrating the distribution of these few sample classes by
transferring statistics from the base classes with sufficient examples, where how to
decide the transfer weights from base classes to novel classes is the key. However,
principled approaches for learning the transfer weights have not been carefully
studied. To this end, we propose a novel distribution calibration method by learning
the adaptive weight matrix between novel samples and base classes, which is
built upon a hierarchical Optimal Transport (H-OT) framework. By minimizing
the high-level OT distance between novel samples and base classes, we can view
the learned transport plan as the adaptive weight information for transferring the
statistics of base classes. The learning of the cost function between a base class
and novel class in the high-level OT leads to the introduction of the low-level OT,
which considers the weights of all the data samples in the base class. Experimental
results on standard benchmarks demonstrate that our proposed plug-and-play model
outperforms competing approaches and owns desired cross-domain generalization
ability, indicating the effectiveness of the learned adaptive weights.
1 Introduction
Deep learning models have become the regular ingredients for numerous computer vision tasks
such as image classification [
1
,
2
] and achieve state-of-the-art performance. However, the strong
performance of deep neural networks typically relies on abundant labeled instances for training [
3
].
Considering the high cost of collecting and annotating a large amount of data, a major research
effort is being dedicated to fields such as transfer learning [
4
] and domain adaptation [
5
]. As a
trending research subject in the low data regime, few-shot classification aims to learn a model on
the data from the base classes, so that the model can generalize well on the tasks sampled from
the novel classes. Several lines of works have been proposed such as those based on meta-learning
paradigms [
3
,
6
11
] and those directly predicting the weights of the classifiers for novel classes
[12, 13]. Recently, methods based on distribution calibration have gained increasing attention. As a
representative example, Yang et al.
[14]
calibrate the feature distribution of the few-sample classes by
transferring the statistics from the base classes and then utilize the sampled data to train a classifier
for novel classes. A unique advantage of distribution calibration methods over others is that they build
on top of off-the-shelf pretrained feature extractors and do not finetune/re-train the feature extractor.
The key of distribution calibration methods is to select the corresponding base classes and transfer
their statistics for the labeled samples in a novel task. Existing approaches in this line usually do so
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04144v1 [cs.LG] 9 Oct 2022
with heuristic or less adaptive solutions. Specifically, Yang et al.
[14]
use the average features of the
samples as the representation of a base class and select the top-
k
(e.g.,
k= 2
) closest base classes
based on the Euclidean distance between the features of a novel sample and a base class. Despite the
effectiveness of Yang et al.
[14]
, it is questionable whether the Euclidean distance is the proper metric
to measure the closeness between a base class and a novel sample since viewing a novel sample and
a base class as points in the same space may not be the best solution. Moreover, it is less sound to
characterize a base class only by the unweighted average over all its samples, when measuring its
closeness with the novel sample. Representing a base class in this way would completely ignore the
fact that each sample of a base class may contribute to the classification boundary differently. Finally,
it may also be less effective to treat each of the top-
k
base classes equally as their contributions can
also be different, not to mention the omission of the other base classes.
To this end, this work develops a more adaptive distribution calibration method leveraging optimal
transport (OT), which is a powerful tool for measuring the cost in transporting the mass in one
distribution to match another given a specific point-to-point cost function. First, we formulate a
distribution
P
over the base classes and a distribution
Q
over the labeled samples from the novel
classes. With such formulation, how to transfer the statistics from the base classes to the novel
samples can be viewed as the OT problem between two distributions, denoted as the high-level OT.
By solving the high-level OT, the learned transport plan can be used as the similarity or closeness
between novel samples and base classes. Since the high-level OT requires specifying the cost function
between one base class and one novel sample, we further introduce a low-level OT problem to learn
this cost automatically, where we formulate a base class as a distribution over its samples. In this
way, the similarity between a novel sample and a base class is no longer representing a base class by
the unweighted average over all its samples and then using the Euclidean distance. In our method,
the weights of the samples are considered in a principled way. In summary, the statistics of base
classes can be better transferred to the novel samples for providing a more effective way to measure
the similarity between them. Notably, even in the challenging cross-domain few-shot learning, our
H-OT can still effectively transfer the statistics from the source domain to the target domain.
We can refer to this adaptive distribution calibration method as a novel hierarchical OT method
(H-OT) for few-shot learning, which is applicable to a range of semi-supervised and supervised tasks,
such as few-shot classification [
9
] and domain adaptation [
5
]. Our contributions are summarized as
follows: (1) We develop a new distribution calibration method for few-shot learning, which can be
built on top of an arbitrary pre-trained feature extractor for being implemented over the feature-level,
without further costly fine-tuning. (2) We formulate the task of transferring statistics from base classes
to novel classes in distribution calibration as the H-OT problem and tackle the task with a principled
solution. (3) We apply our method to few-shot classification and also explore the cross-domain
generalization ability. Experiments on standardized benchmarks demonstrate that introducing the
H-OT into distribution calibration methods can learn adaptive weight matrix, paving a new way to
transfer the statistics of base classes to novel samples.
2 Background
2.1 Optimal Transport Theory
Optimal Transport (OT) is a powerful tool for the comparison of probability distributions, which
has been widely used in various machine learning problems, such as generative models [
15
], text
analysis [
16
,
17
], adversarial robustness [
18
], and imbalanced classification [
19
]. Here we limit
our discussion to OT for discrete distributions and refer the reader to Peyré and Cuturi
[20]
for
more details. Denote
p=Pn
i=1 aiδxi
and
q=Pm
j=1 bjδyj
as two
n
and
m
dimensional discrete
probability distributions, respectively. In this case,
an
and
bm
, where
m
denotes the
probability simplex of Rm. The OT distance between pand qis defined as
OT(p, q) = min
TΠ(p,q)hT,Ci,(1)
where
,·i
denotes the Frobenius dot-product;
CRn×m
0
is the transport cost function with element
Cij =C(xi, yj)
;
TRn×m
>0
denotes the doubly stochastic transport probability matrix such that
Π(p, q) := {T|Pn
iTij =bj,Pm
jTij =ai}
, meaning that
T
has to be one of the joint distribution
of
p
and
q
. As directly optimising Equation
(1)
can be time-consuming for large-scale problems,
the entropic regularization,
H=Pij Tij ln Tij
, is introduced in Cuturi
[21]
, resulting in the
widely-used Sinkhorn algorithm for discrete OT problems with reduced complexity.
2
2.2 Few-Shot Classification
Following a typical few-shot learning problem, we divide the whole dataset with labeled examples
into a base dataset
Dbase
with
B
base classes and a novel dataset
Dnovel
with
Nall
novel classes, each
with a disjoint set of classes. To build a commonly-used
N
-way-
K
-shot task [
8
,
14
], we randomly
sample
N
classes from
Nall
novel classes, and in each class, we only pick
K
(
e.g.
, 1 or 5) samples
for the support set
S={(xi, yi)}N×K
i=1
to train or fine-tune the model and sample
q
instances for the
query set
Q={(xi, yi)}N×K+N×q
i=N×K+1
to evaluate the model. By averaging the accuracy on the query
set of multiple tasks from the novel dataset, we can evaluate the performance of a model.
2.3 Distribution Calibration for Few-Shot Classification
Distribution calibration [
14
] uses the statistics of base classes to estimate the statistics of novel
samples in the support set and generate more samples. Specifically, for the
b
th base class, its samples
are assumed to be generated from a Gaussian distribution, whose mean and covariance matrix are:
µb=1
JbXJb
j=1 xj,Σb=1
Jb1XJb
j=1 (xjµb) (xjµb)T,(2)
where
b[1, B]
,
xjRV
is the
V
-dimensional feature of sample
j
extracted from the pre-trained
feature encoder, Jbthe number of samples in class b, and {µb,Σb}are the statistics of base class b.
The samples of a novel class are also assumed to be generated from a Gaussian distribution with
mean
µ0
and covariance
Σ0
. As the novel class only has one or a few labeled samples, it is hard
to accurately estimate
µ0
and
Σ0
. Thus, the key idea is to transfer the statistics of the base classes
to calibrate the novel class’s distribution. Once the distribution of the novel class is calibrated, we
can generate more samples from it, which are useful for training a good classifier. As a result, how
to effectively transfer the statistics from the base classes is critical to the success of distribution
calibration-based methods for few-shot learning. Accordingly, Free-Lunch [
14
] designs a heuristic
approach that calibrates the Gaussian parameters of a novel distribution with one data sample x:
µ0=Σitopk(Sd)µi+˜
x
k+ 1 ,Σ0=Pitopk(Sd)Σi
k+α, (3)
where: 1)
˜
x
is the transformed data by Tukey’s Ladder of Powers transformation (TLPT) [
22
],
i.e.
,
˜
x=xλif λ6= 0
and
˜
x= log(x)if λ= 0
, for reducing the skewness of distributions and make
distributions more Gaussian-like. 2)
Sd=n− kµb˜
xk2|b=[1, B]o
is a distance set defined by the
Euclidean distance between the transformed feature
˜
x
of a novel sample in support set and the mean
µb
of the base class
b
. 3)
topk(Sd)
is the operation that selects the top-
k
closest base classes from the
set
Sd
. 4)
α
determines the degree of dispersion of features sampled from the calibrated distribution.
Although effective, Free-Lunch represents a base class by the unweighted average over all its samples
when computing its closeness with a novel sample, which ignores the fact that each sample of a base
class may contribute to the classification boundary differently. In addition, the Euclidean distance
in the feature space may not well capture the relations between a base class and a novel sample.
Moreover, each selected top-
k
base class has an equal weight (
i.e.
,
1/k
), which may not reflect the
different contributions of the base classes and omit useful information in unselected base classes.
3 Our Proposed Model
3.1 Overall Method
In this work, we propose a novel adaptive distribution calibration framework, a holistic method for
few-shot classification. Compared to the novel classes, which only have a limited number of labeled
samples, the base classes typically have a sufficient amount of data, allowing their statistics to be
estimated more accurately. Due to the correlation between novel and base classes, it is reasonable
to use the statistics of base classes to revise the distribution of the novel sample. Therefore, the key
is how to transfer the statistics from the base classes to a novel class to achieve the best calibration
results, which is the focus of this paper. Here, we develop the H-OT to learn the Transport Plan
matrix between base classes and novel samples, where each element of the transport plan measures
the importance of each base class for each novel sample and more relevant classes usually have
3
a larger transport probability. Computing the high-level OT requires the specification of the cost
function between one base class and one novel class, which leads to the introduction of a low-level
OT problem. By viewing the learned transport plan as the adaptive weight matrix, we provide an
elegant and principled way to transfer the statistics from the base classes to novel classes.
3.2 Hierarchical OT for Few-Shot Learning
Moving beyond the Free-Lunch method [
14
], which uses the Euclidean distance between a novel
sample and a base class to decide their similarity and endow the chosen base classes with the equal
importance, we aim to capture the correlations between the base class and novel samples at multiple
levels and transfer the related statistics from base classes to novel samples. We learn the similarity
by minimizing a high-level OT distance between base and novel classes and build the cost function
used in high-level OT by further introducing a low-level OT distance. To formulate the task as the
high-level OT problem, we model
P
as a following discrete uniform distribution over
B
base classes:
P=XB
b=1
1
BδRb,(4)
where
Rb
represents the base class
b
in the
V
-dimensional feature space, which will be introduced
later. Taking the
N
-way-
1
-shot task as the example, where each novel class has one labeled sample
x, we represent Qas a discrete uniform distribution over Nnovel classes from support set:
Q=XN
n=1
1
Nδ˜
xn,˜
xnRV×1,(5)
where
˜
xn
is the transformed feature from
x
following Yang et al.
[14]
and detailed below Equation 3.
The OT distance between
P
and
Q
is thus defined as
OT(P, Q) = minTΠ(P,Q)hT,Ci
. We adopt a
regularised OT distance with an entropic constraint [21] and express the optimisation problem as:
OT(P, Q)def.
=XB,N
b,n CbnTbn XB,N
b,n Tbn ln Tbn,(6)
where
 > 0
,
CRB×N
0
is the transport cost matrix, and
Cbn
indicates the cost between base
class
b
and novel sample
n
. Importantly, the transport probability matrix
TRB×N
>0
should satisfy
Π(P, Q) := nPN
nTbn =1
B,PB
bTbn =1
No
with element
Tbn =T(Rb,˜
xn)
, which denotes the
transport probability between the
b
th base class and the
n
th novel sample and is an upper-bounded
positive metric. Therefore,
Tbn
provides a natural way to weight the importance of each base class
which can be used as the class similarity matrix when calibrating the novel distribution. Hence, the
transport plan is the main thing that we want to learn from the data.
To compute the OT in Equation 6, we need to define the cost function
C
, which is the main parameter
defining the transport distance between probability distributions and thus plays the paramount role in
learning the optimal transport plan. In terms of the transport cost matrix, a naive method is to specify
the
C
with Euclidean distance or cosine similarity between the feature space of the novel sample and
mean of the features from the base classes, such as
Cbn =1cos (˜
xn,µb)
. However, these manually
chosen cost functions might have the limited ability to measure the transport cost between a base
class and a novel sample. Besides, representing the base class only with the average of all features in
class
b
might ignore the contributions of different samples for this class. Hence the optimal transport
plan for these cost functions might be inaccurate. To this end, we further introduce a low-level OT
optimization problem to automatically learn the transport cost function
C
in
(6)
. Specifically, we
further treat each base class bas an empirical distribution Rbover the features within this class:
Rb=XJb
j=1 pb
jδxb
j,xb
jRV,(7)
where
pb
j
is the weight of data
xb
j
to base class
b
and captures the importance of this sample and will
be described in short order. Specifically, we train a classifier parameterized by
φ
with the samples in
the base classes, which predicts which base class a sample is in. The predicted probability of sample
j
belonging to the base class
b
is denoted by
sb
j
and then
[pb
1, . . . , pb
Jb]
is obtained by normalizing the
vector
[sb
1, . . . , sb
Jb]
with the Softmax function. We further define the low-level OT distance between
each distribution Rband distributions Qwith an entropic constraint as
OT(Rb, Q)def.
=XN,Jb
n,j Db
jnMb
jn XN,J
n,j Mb
jn ln Mb
jn,(8)
4
摘要:

AdaptiveDistributionCalibrationforFew-ShotLearningwithHierarchicalOptimalTransportDandanGuo1,LongTian2,HeZhao3,MingyuanZhou4,HongyuanZha1,51TheChineseUniversityofHongKong,Shenzhen2XidianUniversity3CSIRO'sData614TheUniversityofTexasatAustin5SchoolofDataScience,ShenzhenInstituteofArticialIntelligence...

展开>> 收起<<
Adaptive Distribution Calibration for Few-Shot Learning with Hierarchical Optimal Transport Dandan Guo1 Long Tian2 He Zhao3 Mingyuan Zhou4 Hongyuan Zha15.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:2.16MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注