
on a large-scale dataset including labeled instances sampled from an initial set of classes, commonly
referred to as the base classes. Subsequently, for novel classes, unobserved during the base training,
supervision is restricted to a limited number of labeled instances per class. During the test phase,
few-shot methods are evaluated over individual tasks, each including a small batch of unlabeled test
samples (the query set) and a few labeled instances per novel class (the support set).
The transduction approach has becomes increasingly popular in few-shot learning, and a large body
of recent methods focused on this setting, including, for instance, those based on graph regularization
[
4
,
5
], optimal transport [
6
,
7
], feature transformations [
8
,
9
], information maximization [
10
,
11
,
12
]
and transductive batch normalization[
13
,
2
], among other works [
14
,
15
,
9
,
16
,
17
]. Transductive
few-shot classifiers make joint predictions for the batch of query samples of each few-shot task,
independently of the other tasks. Unlike inductive inference, in which prediction is made for one
testing sample at a time, transduction accounts for the statistics of the query set of a task
4
, typically
yielding substantial improvements in performance. On standard benchmarks, the gap in classification
accuracy between transductive and inductive few-shot methods may reach
10%
; see [
11
], for example.
This connects with a well-known fact in classical literature on transductive inference [
19
,
20
,
21
],
which prescribes transduction as an effective way to mitigate the difficulty inherent to learning from
limited labels. It is worth mentioning that transductive methods inherently depend on the statistical
properties of the query sets. For instance, the recent studies in [
22
,
10
] showed that variations in the
class balance within the query set may affect the performances of transductive methods.
Transductive few-shot classification occurs in a variety of practical scenarios, in which we naturally
have access to a batch of unlabeled samples at test time. In commonly used few-shot benchmarks,
the query set of each task is small (less than
100
examples), and is sampled from a limited number
of classes (typically 5). Those choices are relevant in practice: During test time, one typically has
access to small unlabeled batches of potentially correlated (non-i.i.d.) samples, e.g., smart-device
photos taken at a given time, video-stream sequences, or in pixel prediction tasks such as semantic
segmentation [
12
], where only a handful of classes appear in the testing batch. However, the standard
few-shot benchmarks are built upon further assumptions that may not always hold in practice: the
few classes effectively present in the unlabeled query set are assumed both to be known beforehand
and to match exactly the set of classes of the labeled support set. We relax these assumptions, and
extend current benchmarks so that the query-set classes are unknown and do not match exactly the
support-set classes, but just belong to a much larger set of possible classes. Specifically, we allow the
total number of classes represented in the support set to be higher than typical values, while keeping
the number of classes that are effectively present in the query set to be much smaller.
Our challenging yet practical setting raises several difficulties for state-of-the-art transductive few-
shot classifiers: (i) it is an instance of the problem of highly imbalanced classification; (ii) the labeled
support set includes “distraction” classes that may not actually be present in the test query samples;
(iii) we consider
K
-way classification tasks with
K
much larger than the typical values used in the
current benchmarks. We evaluated
7
of the best-performing state-of-the-art transductive few-shot
methods and, expectedly, observed drops in their performance in this challenging setting.
Motivated by these experimental observations, we introduce a Minimum Description Length (MDL)
inference, which balances data-fitting accuracy and model complexity for a given few-shot task,
subject to supervision constraints from the support set. The model-complexity term can be viewed as
a continuous relaxation of a discrete label cost, which penalizes the number of non-empty clusters
in the solution, fitting the data of a given task with as few unique labels as necessary. It encourages
competition among a large set of possible classes, preserving only those that fit better the task.
Our formulation is hyper-parameter free, and could be applied on top of any base-class training.
Furthermore, we derive a fast primal-dual block coordinate descent algorithm for optimizing our
objective, with convergence guarantee, and a linear computational complexity at each iteration
thanks to closed-form updates of the variables. We report comprehensive experiments and ablation
studies on mini-Imagenet, tiered-Imagenet, and the more realistic and challenging iNat dataset
[
23
] for fine-grained classification, with
908
classes and
227
ways at test-time. Our method yields
competitive performances in comparison to the state-of-the-art, with gaps increasing substantially
with the numbers of possible classes in the tasks.
4
Note that the transductive setting is different from semi-supervised few-shot learning [
18
] that uses extra
data. The only difference between inductive and tranductive inference is that predictions are made jointly for the
query set of a task, rather than one sample at a time.
2