with heuristic or less adaptive solutions. Specifically, Yang et al.
[14]
use the average features of the
samples as the representation of a base class and select the top-
k
(e.g.,
k= 2
) closest base classes
based on the Euclidean distance between the features of a novel sample and a base class. Despite the
effectiveness of Yang et al.
[14]
, it is questionable whether the Euclidean distance is the proper metric
to measure the closeness between a base class and a novel sample since viewing a novel sample and
a base class as points in the same space may not be the best solution. Moreover, it is less sound to
characterize a base class only by the unweighted average over all its samples, when measuring its
closeness with the novel sample. Representing a base class in this way would completely ignore the
fact that each sample of a base class may contribute to the classification boundary differently. Finally,
it may also be less effective to treat each of the top-
k
base classes equally as their contributions can
also be different, not to mention the omission of the other base classes.
To this end, this work develops a more adaptive distribution calibration method leveraging optimal
transport (OT), which is a powerful tool for measuring the cost in transporting the mass in one
distribution to match another given a specific point-to-point cost function. First, we formulate a
distribution
P
over the base classes and a distribution
Q
over the labeled samples from the novel
classes. With such formulation, how to transfer the statistics from the base classes to the novel
samples can be viewed as the OT problem between two distributions, denoted as the high-level OT.
By solving the high-level OT, the learned transport plan can be used as the similarity or closeness
between novel samples and base classes. Since the high-level OT requires specifying the cost function
between one base class and one novel sample, we further introduce a low-level OT problem to learn
this cost automatically, where we formulate a base class as a distribution over its samples. In this
way, the similarity between a novel sample and a base class is no longer representing a base class by
the unweighted average over all its samples and then using the Euclidean distance. In our method,
the weights of the samples are considered in a principled way. In summary, the statistics of base
classes can be better transferred to the novel samples for providing a more effective way to measure
the similarity between them. Notably, even in the challenging cross-domain few-shot learning, our
H-OT can still effectively transfer the statistics from the source domain to the target domain.
We can refer to this adaptive distribution calibration method as a novel hierarchical OT method
(H-OT) for few-shot learning, which is applicable to a range of semi-supervised and supervised tasks,
such as few-shot classification [
9
] and domain adaptation [
5
]. Our contributions are summarized as
follows: (1) We develop a new distribution calibration method for few-shot learning, which can be
built on top of an arbitrary pre-trained feature extractor for being implemented over the feature-level,
without further costly fine-tuning. (2) We formulate the task of transferring statistics from base classes
to novel classes in distribution calibration as the H-OT problem and tackle the task with a principled
solution. (3) We apply our method to few-shot classification and also explore the cross-domain
generalization ability. Experiments on standardized benchmarks demonstrate that introducing the
H-OT into distribution calibration methods can learn adaptive weight matrix, paving a new way to
transfer the statistics of base classes to novel samples.
2 Background
2.1 Optimal Transport Theory
Optimal Transport (OT) is a powerful tool for the comparison of probability distributions, which
has been widely used in various machine learning problems, such as generative models [
15
], text
analysis [
16
,
17
], adversarial robustness [
18
], and imbalanced classification [
19
]. Here we limit
our discussion to OT for discrete distributions and refer the reader to Peyré and Cuturi
[20]
for
more details. Denote
p=Pn
i=1 aiδxi
and
q=Pm
j=1 bjδyj
as two
n
and
m
dimensional discrete
probability distributions, respectively. In this case,
a∈∆n
and
b∈∆m
, where
∆m
denotes the
probability simplex of Rm. The OT distance between pand qis defined as
OT(p, q) = min
T∈Π(p,q)hT,Ci,(1)
where
h·,·i
denotes the Frobenius dot-product;
C∈Rn×m
≥0
is the transport cost function with element
Cij =C(xi, yj)
;
T∈Rn×m
>0
denotes the doubly stochastic transport probability matrix such that
Π(p, q) := {T|Pn
iTij =bj,Pm
jTij =ai}
, meaning that
T
has to be one of the joint distribution
of
p
and
q
. As directly optimising Equation
(1)
can be time-consuming for large-scale problems,
the entropic regularization,
H=−Pij Tij ln Tij
, is introduced in Cuturi
[21]
, resulting in the
widely-used Sinkhorn algorithm for discrete OT problems with reduced complexity.
2