Towards Practical Few-Shot Query Sets Transductive Minimum Description Length Inference Ségolène Martin

2025-05-06 0 0 1.56MB 16 页 10玖币
侵权投诉
Towards Practical Few-Shot Query Sets:
Transductive Minimum Description Length Inference
Ségolène Martin
Université Paris-Saclay, Inria,
CentraleSupélec, CVN
Malik Boudiaf
ÉTS Montreal
Emilie Chouzenoux
Université Paris-Saclay, Inria,
CentraleSupélec, CVN
Jean-Christophe Pesquet
Université Paris-Saclay, Inria,
CentraleSupélec, CVN
Ismail Ben Ayed
ÉTS Montreal
Abstract
Standard few-shot benchmarks are often built upon simplifying assumptions on the
query sets, which may not always hold in practice. In particular, for each task at
testing time, the classes effectively present in the unlabeled query set are known
a priori, and correspond exactly to the set of classes represented in the labeled
support set. We relax these assumptions and extend current benchmarks, so that the
query-set classes of a given task are unknown, but just belong to a much larger set of
possible classes. Our setting could be viewed as an instance of the challenging yet
practical problem of extremely imbalanced
K
-way classification,
K
being much
larger than the values typically used in standard benchmarks, and with potentially
irrelevant supervision from the support set. Expectedly, our setting incurs drops in
the performances of state-of-the-art methods. Motivated by these observations, we
introduce a
P
rim
A
l
D
ual Minimum
D
escription
LE
ngth (
PADDLE
) formulation,
which balances data-fitting accuracy and model complexity for a given few-shot
task, under supervision constraints from the support set. Our constrained MDL-like
objective promotes competition among a large set of possible classes, preserving
only effective classes that befit better the data of a few-shot task. It is hyper-
parameter free, and could be applied on top of any base-class training. Furthermore,
we derive a fast block coordinate descent algorithm for optimizing our objective,
with convergence guarantee, and a linear computational complexity at each iteration.
Comprehensive experiments over the standard few-shot datasets and the more
realistic and challenging i-Nat dataset show highly competitive performances of
our method, more so when the numbers of possible classes in the tasks increase. Our
code is publicly available at
https://github.com/SegoleneMartin/PADDLE
.
1 Introduction
The performance of deep learning models is often seriously affected when tackling new tasks with
limited supervision, i.e., classes that were unobserved during training and for which we have only a
handful of labeled examples. Few-shot learning [
1
,
2
,
3
] focuses on this generalization challenge,
which occurs in a breadth of applications. In standard few-shot settings, a deep network is first trained
E. Chouzenoux acknowledges support from the European Research Council Starting Grant MAJORIS
ERC-2019-STG-850925.
The work of J.-C. Pesquet is supported by the ANR Chair in AI BRIDGEABLE.
The work of I. Ben Ayed is supported by the DATAIA Institute, and is part of his sabbatical-leave visit at
the Université Paris-Saclay.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14545v1 [cs.LG] 26 Oct 2022
on a large-scale dataset including labeled instances sampled from an initial set of classes, commonly
referred to as the base classes. Subsequently, for novel classes, unobserved during the base training,
supervision is restricted to a limited number of labeled instances per class. During the test phase,
few-shot methods are evaluated over individual tasks, each including a small batch of unlabeled test
samples (the query set) and a few labeled instances per novel class (the support set).
The transduction approach has becomes increasingly popular in few-shot learning, and a large body
of recent methods focused on this setting, including, for instance, those based on graph regularization
[
4
,
5
], optimal transport [
6
,
7
], feature transformations [
8
,
9
], information maximization [
10
,
11
,
12
]
and transductive batch normalization[
13
,
2
], among other works [
14
,
15
,
9
,
16
,
17
]. Transductive
few-shot classifiers make joint predictions for the batch of query samples of each few-shot task,
independently of the other tasks. Unlike inductive inference, in which prediction is made for one
testing sample at a time, transduction accounts for the statistics of the query set of a task
4
, typically
yielding substantial improvements in performance. On standard benchmarks, the gap in classification
accuracy between transductive and inductive few-shot methods may reach
10%
; see [
11
], for example.
This connects with a well-known fact in classical literature on transductive inference [
19
,
20
,
21
],
which prescribes transduction as an effective way to mitigate the difficulty inherent to learning from
limited labels. It is worth mentioning that transductive methods inherently depend on the statistical
properties of the query sets. For instance, the recent studies in [
22
,
10
] showed that variations in the
class balance within the query set may affect the performances of transductive methods.
Transductive few-shot classification occurs in a variety of practical scenarios, in which we naturally
have access to a batch of unlabeled samples at test time. In commonly used few-shot benchmarks,
the query set of each task is small (less than
100
examples), and is sampled from a limited number
of classes (typically 5). Those choices are relevant in practice: During test time, one typically has
access to small unlabeled batches of potentially correlated (non-i.i.d.) samples, e.g., smart-device
photos taken at a given time, video-stream sequences, or in pixel prediction tasks such as semantic
segmentation [
12
], where only a handful of classes appear in the testing batch. However, the standard
few-shot benchmarks are built upon further assumptions that may not always hold in practice: the
few classes effectively present in the unlabeled query set are assumed both to be known beforehand
and to match exactly the set of classes of the labeled support set. We relax these assumptions, and
extend current benchmarks so that the query-set classes are unknown and do not match exactly the
support-set classes, but just belong to a much larger set of possible classes. Specifically, we allow the
total number of classes represented in the support set to be higher than typical values, while keeping
the number of classes that are effectively present in the query set to be much smaller.
Our challenging yet practical setting raises several difficulties for state-of-the-art transductive few-
shot classifiers: (i) it is an instance of the problem of highly imbalanced classification; (ii) the labeled
support set includes “distraction” classes that may not actually be present in the test query samples;
(iii) we consider
K
-way classification tasks with
K
much larger than the typical values used in the
current benchmarks. We evaluated
7
of the best-performing state-of-the-art transductive few-shot
methods and, expectedly, observed drops in their performance in this challenging setting.
Motivated by these experimental observations, we introduce a Minimum Description Length (MDL)
inference, which balances data-fitting accuracy and model complexity for a given few-shot task,
subject to supervision constraints from the support set. The model-complexity term can be viewed as
a continuous relaxation of a discrete label cost, which penalizes the number of non-empty clusters
in the solution, fitting the data of a given task with as few unique labels as necessary. It encourages
competition among a large set of possible classes, preserving only those that fit better the task.
Our formulation is hyper-parameter free, and could be applied on top of any base-class training.
Furthermore, we derive a fast primal-dual block coordinate descent algorithm for optimizing our
objective, with convergence guarantee, and a linear computational complexity at each iteration
thanks to closed-form updates of the variables. We report comprehensive experiments and ablation
studies on mini-Imagenet, tiered-Imagenet, and the more realistic and challenging iNat dataset
[
23
] for fine-grained classification, with
908
classes and
227
ways at test-time. Our method yields
competitive performances in comparison to the state-of-the-art, with gaps increasing substantially
with the numbers of possible classes in the tasks.
4
Note that the transductive setting is different from semi-supervised few-shot learning [
18
] that uses extra
data. The only difference between inductive and tranductive inference is that predictions are made jointly for the
query set of a task, rather than one sample at a time.
2
2 Few-shot setting and task generation
Base training
Let
Dbase ={xn,yn}|Dbase|
n=1
denotes the base dataset, where each
xn∈ Xbase
is a
sample from some input space
X
,
yn∈ {0,1}|Ybase|
the associated ground-truth label, and
Ybase
the
set of base classes. Base training learns a feature extractor
fφ:X → Z
, with parameters
φ
and
Z
a
lower-dimensional space. For this stage, an abundant few-shot literature adopts episodic training,
which views
Dbase
as a series of tasks (or episodes) so as to simulate testing time. Then, a meta-learner
is devised to produce the predictions. However, it has been widely established over recent years that
a basic training, followed by transfer-learning strategies, outperforms most meta-learning methods
[24, 25, 26, 4, 11]. Hence, we adopt a standard cross-entropy training in this work.
Evaluation
Evaluation is carried out over few-shot tasks, each containing samples from
Dtest =
{xn,yn}|Dtest|
n=1
, where
yn∈ {0,1}|Ytest|
, with constraint
Ybase ∩ Ytest =
, i.e., the test and base
classes are distinct. Each task includes a labelled support set
S={xn,yn}nS
and an unlabelled
query set
Q={xn}nQ
, both containing examples from classes in
Ytest
. Using the feature extractor
fφ
trained on base data, the goal is to predict the classes of the unlabeled samples in
Q
for each
few-shot task, independently of the other tasks.
Task generation
For a given task, let
K
denote the total number of possible classes that are
represented in the labeled support set
S
, and
Keff
the number of classes that appear effectively in
unlabeled query set
Q
(i.e. classes represented by at least one sample). The standard task generation
protocol assumes that the set of effective classes in
Q
matches exactly the set of classes in
S
, i.e.
K=Keff  |Ytest|
, which amounts to knowing exactly the few classes that appear in the test samples.
Then, a fixed number of instances per class are sampled for each query set, forcing class balance. We
relax these assumptions, and propose a setting where the set of effective classes in
Q
is not known
exactly and belongs to a much larger set of possible classes. First, we allow the total number of
possible classes
K
to be higher than typical values, with
K=|Ytest|
, while keeping the number of
effective classes in the query set to be typically much smaller:
Keff  |Ytest|=K
. For instance,
in our experiments,
K
can go up to
227
. This, as expected, results in
K
-way problems that are
more challenging than the standard 5-way tasks used in the few-shot literature. Secondly, once the
effective classes are fixed, and given a budget of images, we sample the query set under the data joint
distribution (i.e. uniform draws among all available examples), so that the statistics of the sampled
query sets more faithfully reflect the natural distribution of classes. Figure 1 illustrates our framework
(base training, inference and task generation).
3 Proposed few-shot inference formulation
For a given few-shot task, let
Q={xn}nIQ
and
S={xn,yn}nIS
be two subsets of
Dtest
such
that
QS=
. Let
N=|Q|+|S|
. Up to a reordering, we can suppose that
IQ={1,...,|Q|}
and
IS={|Q|+ 1, . . . , N}
. We denote
zn=fφ(xn)
the feature vector corresponding to the data
sample
xn
, and
un= (un,k)1kK∈ {0,1}K
the variable assigning the data to one of the possible
classes in
{1, . . . , K}
, i.e.
un,k = 1
if
xn
belongs to class
k
and
0
otherwise. We define the variables
giving the class proportions ˆu= (ˆuk)1kKKas
ˆuk=1
|Q|
|Q|
X
n=1
un,k k∈ {1, . . . , K},(1)
where Kis the unit simplex of RK.
We propose to cast transductive few-shot inference as the minimization of an objective balancing
data-fitting accuracy and partition complexity, subject to supervision constraints (known labels) from
the support set
S
. Our approach estimates jointly
U= (un)1nN
, which defines a partition of
3
Figure 1: Overview of the proposed framework: Base training, PADDLE inference and task genera-
tion. The example depicted in the right-hand side illustrates how the support and query classes do not
match exactly, unlike in standard few-shot settings. The support set includes “distraction” classes that
may not actually be present in the query set, e.g. classes “Alligator”, “Peacock” and “Butterfly”. All
possible classes (
K
classes) are represented in the support set, but only an unknown subset among
these Kpossible classes (Keff effective classes) appear in the query set, with Keff K.
SQ, and the class prototypes W= (wk)1kK(Rd)Kthrough the following problem:
minimize
U,W
1
2
K
X
k=1
N
X
n=1
un,kkwkznk2
| {z }
data-fitting accuracy
λ
K
X
k=1
ˆukln(ˆuk)
| {z }
partition complexity
,(2)
s.t unKn∈ {1,...,|Q|},
un=ynn∈ {|Q|+ 1, . . . , N}.(C)
Note that, in the second line of
(2)
, we have relaxed the integer constraints on query assignments to
facilitate optimization.
Effect of each term
The purpose of objective
(2)
is to classify the data of a few-shot task with as
few unique labels as necessary. The data-fitting term has the same form as the standard
K
-means
objective for clustering, but is constrained with supervision from the support-set labels. This term
evaluates, within each class
k
, the deviation of features from class prototype
wk
, thereby encouraging
consistency of samples of the same class. The partition-complexity term implicitly penalizes the
number of effective (non-empty) classes that appear in the solution (
Keff K
), encouraging low
cardinality partitions of query set
Q
. This term is the Shannon entropy of class proportions within
Q
: it reaches its minimum when all the samples of
Q
belong to a single class, i.e.,
j∈ {1, . . . , K}
such that
ˆuj= 1
and all other proportions vanish. It achieves its maximum for perfectly balanced
partitions of
Q
, satisfying
ˆuk= 1/K
for all
k
. In practice, this terms promotes solutions that contain
only a handful of effective classes among a larger set of Kpossible classes.
Connection to Minimum Description Length (MDL)
The objective in
(2)
could be viewed as a
partially-supervised instantiation of the general MDL principle. Originated in information theory,
MDL is widely used in statistical model selection [
27
]. Assume that we want to describe some
input data Z= (zn)1n≤|Q|with a statistical model M, among a family of possible models. MDL
prescribes that the best model corresponds to the shortest description of the data, according to some
coding scheme underlying the model. It balances data-fitting accuracy and model complexity by
minimizing
L(Z|M) + λL(M)
w.r.t
M
.
L(Z|M)
measures the code length of the prediction of
Z
made by
M
, while
L(M)
encourages simpler models (Occam’s razor [
27
]), e.g., models with less
parameters. For instance, in unsupervised clustering, it is common to minimize a discrete label cost as
a measure of model complexity [
28
,
29
,
30
]. Such a label cost is the number of effective (non-empty)
4
摘要:

TowardsPracticalFew-ShotQuerySets:TransductiveMinimumDescriptionLengthInferenceSégolèneMartinUniversitéParis-Saclay,Inria,CentraleSupélec,CVNMalikBoudiafÉTSMontrealEmilieChouzenouxUniversitéParis-Saclay,Inria,CentraleSupélec,CVNJean-ChristophePesquetyUniversitéParis-Saclay,Inria,CentraleSupélec,CVN...

展开>> 收起<<
Towards Practical Few-Shot Query Sets Transductive Minimum Description Length Inference Ségolène Martin.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.56MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注