
2 Related work
Learning from target models.
Learning from an expert model, i.e., the target model, has shown its
effectiveness across various domains [
30
,
35
,
65
,
52
]. As a follow-up, recent papers demonstrate that
meta-learning can also be the case [
58
,
62
]. However, training independent task-specific target models
is highly expensive due to the large space of task distribution in meta-learning. To this end, recent
work suggests pre-training a global encoder on the whole meta-training set and finetune target models
on each task [
32
]; however, they are limited to specific domains and still require some computations,
e.g., they take more than 6.5 GPU hours to pre-train only 10% of target models while ours require 2
GPU hours for the entire meta-learning process (ProtoNet [
45
] of ResNet-12 [
34
]) on the same GPU.
Another recent relevant work is bootstrapped meta-learning [
11
], which generates the target model
from the meta-model by further updating the parameters of the task-specific solver for some number
of steps with the query dataset. While the bootstrapped target models can be obtained efficiently, their
approach is specialized in gradient-based meta-learning schemes, e.g., MAML [
10
]. In this paper, we
suggest an efficient and more generic way to generate the target model during the meta-training.
Learning with momentum networks.
The idea of temporal ensembling, i.e., the momentum net-
work, has become an essential component of the recent semi/self-supervised learning algorithms [
3
,
5
].
For example, Mean Teacher [
50
] first showed that the momentum network improves the performance
of semi-supervised image classification, and recent advanced approaches [
2
,
46
] adopted this idea for
achieving state-of-the-art performances. Also, in self-supervised learning methods which enforce
invariance to data augmentation, such momentum networks are widely utilized as a target network
[
19
,
16
] to prevent collapse by providing smoother changes in the representations. In meta-learning,
a concurrent work [
6
] used stochastic weight averaging [
23
] (a similar approach to the momentum
network) to learn a low-rank representation. In this paper, we empirically demonstrate that the mo-
mentum network shows better adaptation performance compare to the original meta-model, which
motivates us to utilize it for generating the target model in a compute-efficient manner.
3 Problem setup and evaluation protocols
In this section, we formally describe the meta-learning setup under consideration, and
S
/
Q
and
S/Tprotocols studied in prior works.
Problem setup: Meta-learning.
Let
p(τ)
be a distribution of tasks. The goal of meta-learning is
to train a meta-model
fθ
, parameterized by the meta-model parameter
θ
, which can transfer its
knowledge to help to train a solver for a new task. More formally, we consider some adaptation
subroutine
Adapt(·,·)
which uses both information transferred from
θ
and the task-specific dataset
(which we call support set)
Sτ
to output a task-specific solver as
φτ=Adapt(θ, Sτ)
. For example,
the model-agnostic meta-learning algorithm (MAML; [
10
]) uses the adaptation subroutine of taking
a fixed number of SGD on
Sτ
, starting from the initial parameter
θ
. In this paper, we aim to give a
general meta-learning framework that can be used in conjunction with any adaptation subroutine,
instead of designing a method specialized for a specific one.
The objective is to learn a nice meta-model parameter
θ
from a set of tasks sampled from
p(τ)
(or
sometimes the task distribution itself), such that the expected loss of the task-specific adaptations
is small, i.e.,
minθEτ∼p(τ)[`τ(Adapt(θ, Sτ))]
, where
`τ(·)
denotes the test loss on task
τ
. To train
such meta-model, we need a mechanism to evaluate and optimize
θ
(e.g., via gradient descent). For
this purpose, existing approaches take one of two approaches: the
S
/
Q
protocol or the
S
/
T
protocol.
S/Qprotocol.
The majority of the existing meta-learning frameworks (e.g., [
55
,
34
]) splits the
task-specific training data into two, and use them for different purposes. One is the support set
Sτ
which is used to perform the adaptation subroutine. Another is the query set
Qτ
which is used for
evaluating the performance of the adapted parameter and compute the gradient with respect to
θ
. In
other words, given the task datasets (S1,Q1),(S2,Q2),...,(SN,QN),2the S/Qprotocol solves
min
θ
1
N
N
X
i=1
LAdapt(θ, Sτi),Qτi,(1)
where L(φ, Q)denotes the empirical loss of a solver φon the dataset Q.
2
Here, while we assumed a static batch of tasks for notational simplicity, the expression is readily extendible
to the case of a stream of tasks drawn from p(τ).
3