
T
Training
S
Knowledge
Distillation
Meta-training
Tasks
Jointly training
Adaptation
…
S
Novel task data
(Few)
T
Adapt
Adapt
SDeploy Tibetan mastiff
Deployment Computational
resources
Computational resources
TTeacher model
SStudent model
Figure 2: Key idea of our approach. On the server side,
we jointly learn a large teacher model and a small student
model in a meta-training framework. At the client side,
the client first performs an adaptation stage. During this
stage, the teacher model is first adapted to the task, then the
adapted teacher model is used to guide the adaptation of the
student model via distillation. The adapted student model
is then used for final deployment. Different stages (meta-
training, adaptation, deployment) of this pipeline involve
different levels of computational resources.
during meta-training is the same as the one used by the
client for the final deployment. In this paper, we challenge
this basic assumption of existing meta-learning solutions.
We propose a new problem setting that takes into account
of different levels of available computational resources dur-
ing meta-training and meta-testing.
Our problem setting is motivated by a practical scenario
(shown in Fig. 1) consisting of a server and many clients.
For example, the server can be a cloud vendor that pro-
vides pretrained image classification models (possibly via
web API). On the server side, the cloud vendor may have
a large image dataset with many object classes. The cloud
vendor typically has access to significant computational re-
sources to train very large models. We also have clients
who are interested in solving some application-specific im-
age classification problems. Each client may only be inter-
ested in recognizing a handful of object classes that are po-
tentially not covered by the training dataset from the server
side. For example, one client might be a medical doctor in-
terested in recognizing different tumors in medical images,
while another client might be a retail owner interested in
classifying different merchandise in a store. Because of the
cost of acquiring labeled images, each client may only have
a small number of labeled examples for the target applica-
tion. Due to privacy concerns, clients may not want to send
their data to the cloud vendor. In this case, a natural so-
lution is for a client to re-use a pretrained model provided
by the cloud vendor and perform few-shot learning to adapt
the pretrained model to the new object classes for the target
application.
At first glance, the scenario in Fig. 1 is a classic meta-
learning problem. The cloud vendor can perform meta-
training on the server to obtain a global model. On the client
side, the client performs two steps. The first step (called
adaptation) is to adapt the global model from the server
side to the target application. For example, the adaptation
step in MAML performs a few gradient updates on the few-
shot data. After the adaptation, the second step (called de-
ployment) is to deploy the adapted model for the end ap-
plication. The combination of these two steps (adaptation
and deployment) is commonly known as “meta-testing” in
meta-learning. In this paper, we make a distinction between
adaptation and deployment since this distinction is impor-
tant for motivating our problem setting. Our key observa-
tion is that the available computing resources are vastly dif-
ferent in these different stages. The meta-training stage is
done on a server or the cloud with significant computing re-
sources. The adaptation stage is often done on a client’s lo-
cal machine with moderate computing powers (e.g., desktop
or laptop). For the deployment, we may only have access to
very limited computing power if the model is deployed on
an edge device. If we want to use classic meta-learning in
this case, we have to choose a small model architecture to
make sure that the final model can be deployed on the edge
device. Unfortunately, previous work [27, 34] has shown
that a small model may not have enough capacity to con-
sume the information from a large amount of data available
during meta-training, so the learned model may not effec-
tively adapt to a new task.
In this paper, we propose a new approach called task-
specific meta distillation to solve this problem. The key
idea of our approach is illustrated in Fig. 2. During meta-
training, we simultaneously learn a large teacher model and
a small student model. Since the teacher model has a larger
capacity, it can better adapt to a new task. During meta-
training, these two models are jointly learned in a way that
the teacher model can effectively guide the adaptation of the
student model. During the adaptation step of meta-testing,
we first adapt the teacher model to the target task, then use
the adapted teacher to guide the adaptation of the student
model via knowledge distillation. Finally, the adapted stu-
dent model is used for the final deployment. In this paper,
we apply our proposed approach to improve few-shot im-
age classification with MAML. But our technique is gen-
erally applicable in other meta-learning tasks beyond few-
shot learning.
The contributions of this work are manifold. First, pre-
vious work in meta-learning has largely overlooked the is-
sue of the computational resource gap at different stages
of meta-learning. This issue poses challenges in the real-
world adoption of meta-learning applications. In this pa-