
META-ENSEMBLE PARAMETER LEARNING
Zhengcong Fei, Shuman Tian, Junshi Huang∗
, Xiaoming Wei, Xiaolin Wei
Meituan
Beijing, China
{name}@meituan.com
ABSTRACT
Ensemble of machine learning models yields improved performance as well as
robustness. However, their memory requirements and inference costs can be pro-
hibitively high. Knowledge distillation is an approach that allows a single model
to efficiently capture the approximate performance of an ensemble while showing
poor scalability as demand for re-training when introducing new teacher models.
In this paper, we study if we can utilize the meta-learning strategy to directly
predict the parameters of a single model with comparable performance of an en-
semble. Hereto, we introduce WeightFormer, a Transformer-based model that can
predict student network weights layer by layer in a forward pass, according to
the teacher model parameters. The proprieties of WeightFormer are investigated
on the CIFAR-10, CIFAR-100, and ImageNet datasets for model structures of
VGGNet-11, ResNet-50, and ViT-B/32, where it demonstrates that our method
can achieve approximate classification performance of an ensemble and outper-
forms both the single network and standard knowledge distillation. More encour-
agingly, we show that WeightFormer results can further exceeds average ensem-
ble with minor fine-tuning. Importantly, our task along with the model and results
can potentially lead to a new, more efficient, and scalable paradigm of ensemble
networks parameter learning.
1 INTRODUCTION
As machine learning models are being deployed ever more widely in practice, memory cost and
inference efficiency become increasingly important (Bucilua et al., 2006; Polino et al., 2018). En-
semble methods, which train several independent models to form a decision, are well known to yield
both improved performance and reliable estimations (Perrone & Cooper, 1992; Drucker et al., 1994;
Opitz & Maclin, 1999; Dietterich, 2000; Sagi & Rokach, 2018). Despite their useful property, using
ensembles can be computationally prohibitive. Obtaining predictions in real-time applications is
often expensive even for a single model, and the hardware requirements for serving an ensemble
scales linearly with number of teacher models (Buizza & Palmer, 1998; Bonab & Can, 2019). As a
result, over the past several years the area of knowledge distillation has gained increasing attention
(Hinton et al., 2015; Freitag et al., 2017; Malinin et al., 2019; Lin et al., 2020; Park et al., 2021; Zhao
et al., 2022). Broadly speaking, distillation methods aim to involve a single student model which
can approximate the behavior of a teacher ensemble, but at a low computational cost. In the simplest
and most frequently used form of distillation (Hinton et al., 2015), the student model is trained to
capture the average prediction of the ensemble, e.g., in the case of image classification, this reduces
to minimizing the KL divergence between the soft labels of student model and teacher models.
When optimizing the parameters for a new ensemble model, typical knowledge distillation process
disregards information on teacher model parameters and past training experience for distillation of
teacher models. However, leveraging this training information can be the key to reduce the high
computational demands. To progress in this direction, we propose a new task, referred to as Meta-
Ensemble Parameter Learning, where parameters of the distillation student model are directly pre-
dicted with a weight prediction network. The main idea is to use deep learning models to learn the
parameter distillation process and finally generate an entire student model by producing all model
weights in a single pass. This can reduce the overall computation cost in cases where the tasks
∗Corresponding author.
1
arXiv:2210.01973v1 [cs.CV] 5 Oct 2022