Adaptive Distribution Calibration for Few-Shot Learning with Hierarchical Optimal Transport Dandan Guo1 Long Tian2 He Zhao3 Mingyuan Zhou4 Hongyuan Zha15

2025-05-01 2 0 2.16MB 20 页 10玖币

侵权投诉

Adaptive Distribution Calibration for Few-Shot

Learning with Hierarchical Optimal Transport

Dandan Guo 1, Long Tian2, He Zhao 3, Mingyuan Zhou4, Hongyuan Zha1,5

1The Chinese University of Hong Kong, Shenzhen 2Xidian University

3CSIRO’s Data61 4The University of Texas at Austin

5School of Data Science, Shenzhen Institute of Artiﬁcial Intelligence and Robotics for Society

guodandan@cuhk.edu.cn tianlong@xidian.edu.cn he.zhao@ieee.org

mingyuan.zhou@mccombs.utexas.edu zhahy@cuhk.edu.cn

Abstract

Few-shot classiﬁcation aims to learn a classiﬁer to recognize unseen classes during

training, where the learned model can easily become over-ﬁtted based on the

biased distribution formed by only a few training examples. A recent solution

to this problem is calibrating the distribution of these few sample classes by

transferring statistics from the base classes with sufﬁcient examples, where how to

decide the transfer weights from base classes to novel classes is the key. However,

principled approaches for learning the transfer weights have not been carefully

studied. To this end, we propose a novel distribution calibration method by learning

the adaptive weight matrix between novel samples and base classes, which is

built upon a hierarchical Optimal Transport (H-OT) framework. By minimizing

the high-level OT distance between novel samples and base classes, we can view

the learned transport plan as the adaptive weight information for transferring the

statistics of base classes. The learning of the cost function between a base class

and novel class in the high-level OT leads to the introduction of the low-level OT,

which considers the weights of all the data samples in the base class. Experimental

results on standard benchmarks demonstrate that our proposed plug-and-play model

outperforms competing approaches and owns desired cross-domain generalization

ability, indicating the effectiveness of the learned adaptive weights.

1 Introduction

Deep learning models have become the regular ingredients for numerous computer vision tasks

such as image classiﬁcation [

] and achieve state-of-the-art performance. However, the strong

performance of deep neural networks typically relies on abundant labeled instances for training [

Considering the high cost of collecting and annotating a large amount of data, a major research

effort is being dedicated to ﬁelds such as transfer learning [

] and domain adaptation [

]. As a

trending research subject in the low data regime, few-shot classiﬁcation aims to learn a model on

the data from the base classes, so that the model can generalize well on the tasks sampled from

the novel classes. Several lines of works have been proposed such as those based on meta-learning

paradigms [

–

] and those directly predicting the weights of the classiﬁers for novel classes

[12, 13]. Recently, methods based on distribution calibration have gained increasing attention. As a

representative example, Yang et al.

[14]

calibrate the feature distribution of the few-sample classes by

transferring the statistics from the base classes and then utilize the sampled data to train a classiﬁer

for novel classes. A unique advantage of distribution calibration methods over others is that they build

on top of off-the-shelf pretrained feature extractors and do not ﬁnetune/re-train the feature extractor.

The key of distribution calibration methods is to select the corresponding base classes and transfer

their statistics for the labeled samples in a novel task. Existing approaches in this line usually do so

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04144v1 [cs.LG] 9 Oct 2022

with heuristic or less adaptive solutions. Speciﬁcally, Yang et al.

[14]

use the average features of the

samples as the representation of a base class and select the top-

(e.g.,

k= 2

) closest base classes

based on the Euclidean distance between the features of a novel sample and a base class. Despite the

effectiveness of Yang et al.

[14]

, it is questionable whether the Euclidean distance is the proper metric

to measure the closeness between a base class and a novel sample since viewing a novel sample and

a base class as points in the same space may not be the best solution. Moreover, it is less sound to

characterize a base class only by the unweighted average over all its samples, when measuring its

closeness with the novel sample. Representing a base class in this way would completely ignore the

fact that each sample of a base class may contribute to the classiﬁcation boundary differently. Finally,

it may also be less effective to treat each of the top-

base classes equally as their contributions can

also be different, not to mention the omission of the other base classes.

To this end, this work develops a more adaptive distribution calibration method leveraging optimal

transport (OT), which is a powerful tool for measuring the cost in transporting the mass in one

distribution to match another given a speciﬁc point-to-point cost function. First, we formulate a

distribution

over the base classes and a distribution

over the labeled samples from the novel

classes. With such formulation, how to transfer the statistics from the base classes to the novel

samples can be viewed as the OT problem between two distributions, denoted as the high-level OT.

By solving the high-level OT, the learned transport plan can be used as the similarity or closeness

between novel samples and base classes. Since the high-level OT requires specifying the cost function

between one base class and one novel sample, we further introduce a low-level OT problem to learn

this cost automatically, where we formulate a base class as a distribution over its samples. In this

way, the similarity between a novel sample and a base class is no longer representing a base class by

the unweighted average over all its samples and then using the Euclidean distance. In our method,

the weights of the samples are considered in a principled way. In summary, the statistics of base

classes can be better transferred to the novel samples for providing a more effective way to measure

the similarity between them. Notably, even in the challenging cross-domain few-shot learning, our

H-OT can still effectively transfer the statistics from the source domain to the target domain.

We can refer to this adaptive distribution calibration method as a novel hierarchical OT method

(H-OT) for few-shot learning, which is applicable to a range of semi-supervised and supervised tasks,

such as few-shot classiﬁcation [

] and domain adaptation [

]. Our contributions are summarized as

follows: (1) We develop a new distribution calibration method for few-shot learning, which can be

built on top of an arbitrary pre-trained feature extractor for being implemented over the feature-level,

without further costly ﬁne-tuning. (2) We formulate the task of transferring statistics from base classes

to novel classes in distribution calibration as the H-OT problem and tackle the task with a principled

solution. (3) We apply our method to few-shot classiﬁcation and also explore the cross-domain

generalization ability. Experiments on standardized benchmarks demonstrate that introducing the

H-OT into distribution calibration methods can learn adaptive weight matrix, paving a new way to

transfer the statistics of base classes to novel samples.

2 Background

2.1 Optimal Transport Theory

Optimal Transport (OT) is a powerful tool for the comparison of probability distributions, which

has been widely used in various machine learning problems, such as generative models [

], text

analysis [

], adversarial robustness [

], and imbalanced classiﬁcation [

]. Here we limit

our discussion to OT for discrete distributions and refer the reader to Peyré and Cuturi

[20]

for

more details. Denote

p=Pn

i=1 aiδxi

and

q=Pm

j=1 bjδyj

as two

and

dimensional discrete

probability distributions, respectively. In this case,

a∈∆n

and

b∈∆m

, where

∆m

denotes the

probability simplex of Rm. The OT distance between pand qis deﬁned as

OT(p, q) = min

T∈Π(p,q)hT,Ci,(1)

where

h·,·i

denotes the Frobenius dot-product;

C∈Rn×m

≥0

is the transport cost function with element

Cij =C(xi, yj)

;

T∈Rn×m

denotes the doubly stochastic transport probability matrix such that

Π(p, q) := {T|Pn

iTij =bj,Pm

jTij =ai}

, meaning that

has to be one of the joint distribution

and

. As directly optimising Equation

(1)

can be time-consuming for large-scale problems,

the entropic regularization,

H=−Pij Tij ln Tij

, is introduced in Cuturi

[21]

, resulting in the

widely-used Sinkhorn algorithm for discrete OT problems with reduced complexity.

2.2 Few-Shot Classiﬁcation

Following a typical few-shot learning problem, we divide the whole dataset with labeled examples

into a base dataset

Dbase

with

base classes and a novel dataset

Dnovel

with

Nall

novel classes, each

with a disjoint set of classes. To build a commonly-used

-way-

-shot task [

], we randomly

sample

classes from

Nall

novel classes, and in each class, we only pick

(

e.g.

, 1 or 5) samples

for the support set

S={(xi, yi)}N×K

i=1

to train or ﬁne-tune the model and sample

instances for the

query set

Q={(xi, yi)}N×K+N×q

i=N×K+1

to evaluate the model. By averaging the accuracy on the query

set of multiple tasks from the novel dataset, we can evaluate the performance of a model.

2.3 Distribution Calibration for Few-Shot Classiﬁcation

Distribution calibration [

] uses the statistics of base classes to estimate the statistics of novel

samples in the support set and generate more samples. Speciﬁcally, for the

th base class, its samples

are assumed to be generated from a Gaussian distribution, whose mean and covariance matrix are:

µb=1

JbXJb

j=1 xj,Σb=1

Jb−1XJb

j=1 (xj−µb) (xj−µb)T,(2)

where

b∈[1, B]

xj∈RV

is the

-dimensional feature of sample

extracted from the pre-trained

feature encoder, Jbthe number of samples in class b, and {µb,Σb}are the statistics of base class b.

The samples of a novel class are also assumed to be generated from a Gaussian distribution with

mean

µ0

and covariance

Σ0

. As the novel class only has one or a few labeled samples, it is hard

to accurately estimate

µ0

and

Σ0

. Thus, the key idea is to transfer the statistics of the base classes

to calibrate the novel class’s distribution. Once the distribution of the novel class is calibrated, we

can generate more samples from it, which are useful for training a good classiﬁer. As a result, how

to effectively transfer the statistics from the base classes is critical to the success of distribution

calibration-based methods for few-shot learning. Accordingly, Free-Lunch [

] designs a heuristic

approach that calibrates the Gaussian parameters of a novel distribution with one data sample x:

µ0=Σi∈topk(Sd)µi+˜

k+ 1 ,Σ0=Pi∈topk(Sd)Σi

k+α, (3)

where: 1)

is the transformed data by Tukey’s Ladder of Powers transformation (TLPT) [

i.e.

x=xλif λ6= 0

and

x= log(x)if λ= 0

, for reducing the skewness of distributions and make

distributions more Gaussian-like. 2)

Sd=n− kµb−˜

xk2|b=[1, B]o

is a distance set deﬁned by the

Euclidean distance between the transformed feature

of a novel sample in support set and the mean

µb

of the base class

. 3)

topk(Sd)

is the operation that selects the top-

closest base classes from the

set

. 4)

determines the degree of dispersion of features sampled from the calibrated distribution.

Although effective, Free-Lunch represents a base class by the unweighted average over all its samples

when computing its closeness with a novel sample, which ignores the fact that each sample of a base

class may contribute to the classiﬁcation boundary differently. In addition, the Euclidean distance

in the feature space may not well capture the relations between a base class and a novel sample.

Moreover, each selected top-

base class has an equal weight (

i.e.

1/k

), which may not reﬂect the

different contributions of the base classes and omit useful information in unselected base classes.

3 Our Proposed Model

3.1 Overall Method

In this work, we propose a novel adaptive distribution calibration framework, a holistic method for

few-shot classiﬁcation. Compared to the novel classes, which only have a limited number of labeled

samples, the base classes typically have a sufﬁcient amount of data, allowing their statistics to be

estimated more accurately. Due to the correlation between novel and base classes, it is reasonable

to use the statistics of base classes to revise the distribution of the novel sample. Therefore, the key

is how to transfer the statistics from the base classes to a novel class to achieve the best calibration

results, which is the focus of this paper. Here, we develop the H-OT to learn the Transport Plan

matrix between base classes and novel samples, where each element of the transport plan measures

the importance of each base class for each novel sample and more relevant classes usually have

a larger transport probability. Computing the high-level OT requires the speciﬁcation of the cost

function between one base class and one novel class, which leads to the introduction of a low-level

OT problem. By viewing the learned transport plan as the adaptive weight matrix, we provide an

elegant and principled way to transfer the statistics from the base classes to novel classes.

3.2 Hierarchical OT for Few-Shot Learning

Moving beyond the Free-Lunch method [

], which uses the Euclidean distance between a novel

sample and a base class to decide their similarity and endow the chosen base classes with the equal

importance, we aim to capture the correlations between the base class and novel samples at multiple

levels and transfer the related statistics from base classes to novel samples. We learn the similarity

by minimizing a high-level OT distance between base and novel classes and build the cost function

used in high-level OT by further introducing a low-level OT distance. To formulate the task as the

high-level OT problem, we model

as a following discrete uniform distribution over

base classes:

P=XB

b=1

BδRb,(4)

where

represents the base class

in the

-dimensional feature space, which will be introduced

later. Taking the

-way-

-shot task as the example, where each novel class has one labeled sample

x, we represent Qas a discrete uniform distribution over Nnovel classes from support set:

Q=XN

n=1

Nδ˜

xn,˜

xn∈RV×1,(5)

where

is the transformed feature from

following Yang et al.

[14]

and detailed below Equation 3.

The OT distance between

and

is thus deﬁned as

OT(P, Q) = minT∈Π(P,Q)hT,Ci

. We adopt a

regularised OT distance with an entropic constraint [21] and express the optimisation problem as:

OT(P, Q)def.

=XB,N

b,n CbnTbn −XB,N

b,n −Tbn ln Tbn,(6)

where

 > 0

C∈RB×N

≥0

is the transport cost matrix, and

Cbn

indicates the cost between base

class

and novel sample

. Importantly, the transport probability matrix

T∈RB×N

should satisfy

Π(P, Q) := nPN

nTbn =1

B,PB

bTbn =1

with element

Tbn =T(Rb,˜

xn)

, which denotes the

transport probability between the

th base class and the

th novel sample and is an upper-bounded

positive metric. Therefore,

Tbn

provides a natural way to weight the importance of each base class

which can be used as the class similarity matrix when calibrating the novel distribution. Hence, the

transport plan is the main thing that we want to learn from the data.

To compute the OT in Equation 6, we need to deﬁne the cost function

, which is the main parameter

deﬁning the transport distance between probability distributions and thus plays the paramount role in

learning the optimal transport plan. In terms of the transport cost matrix, a naive method is to specify

the

with Euclidean distance or cosine similarity between the feature space of the novel sample and

mean of the features from the base classes, such as

Cbn =1−cos (˜

xn,µb)

. However, these manually

chosen cost functions might have the limited ability to measure the transport cost between a base

class and a novel sample. Besides, representing the base class only with the average of all features in

class

might ignore the contributions of different samples for this class. Hence the optimal transport

plan for these cost functions might be inaccurate. To this end, we further introduce a low-level OT

optimization problem to automatically learn the transport cost function

(6)

. Speciﬁcally, we

further treat each base class bas an empirical distribution Rbover the features within this class:

Rb=XJb

j=1 pb

jδxb

j,xb

j∈RV,(7)

where

is the weight of data

to base class

and captures the importance of this sample and will

be described in short order. Speciﬁcally, we train a classiﬁer parameterized by

with the samples in

the base classes, which predicts which base class a sample is in. The predicted probability of sample

belonging to the base class

is denoted by

and then

[pb

1, . . . , pb

Jb]

is obtained by normalizing the

vector

[sb

1, . . . , sb

Jb]

with the Softmax function. We further deﬁne the low-level OT distance between

each distribution Rband distributions Qwith an entropic constraint as

OT(Rb, Q)def.

=XN,Jb

n,j Db

jnMb

jn −XN,J

n,j −Mb

jn ln Mb

jn,(8)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdaptiveDistributionCalibrationforFew-ShotLearningwithHierarchicalOptimalTransportDandanGuo1,LongTian2,HeZhao3,MingyuanZhou4,HongyuanZha1,51TheChineseUniversityofHongKong,Shenzhen2XidianUniversity3CSIRO'sData614TheUniversityofTexasatAustin5SchoolofDataScience,ShenzhenInstituteofArticialIntelligence...

展开>> 收起<<

Adaptive Distribution Calibration for Few-Shot Learning with Hierarchical Optimal Transport Dandan Guo1 Long Tian2 He Zhao3 Mingyuan Zhou4 Hongyuan Zha15.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive Distribution Calibration for Few-Shot Learning with Hierarchical Optimal Transport Dandan Guo1 Long Tian2 He Zhao3 Mingyuan Zhou4 Hongyuan Zha15

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: