META-ENSEMBLE PARAMETER LEARNING Zhengcong Fei Shuman Tian Junshi Huang Xiaoming Wei Xiaolin Wei Meituan

2025-05-02 0 0 439.23KB 13 页 10玖币
侵权投诉
META-ENSEMBLE PARAMETER LEARNING
Zhengcong Fei, Shuman Tian, Junshi Huang
, Xiaoming Wei, Xiaolin Wei
Meituan
Beijing, China
{name}@meituan.com
ABSTRACT
Ensemble of machine learning models yields improved performance as well as
robustness. However, their memory requirements and inference costs can be pro-
hibitively high. Knowledge distillation is an approach that allows a single model
to efficiently capture the approximate performance of an ensemble while showing
poor scalability as demand for re-training when introducing new teacher models.
In this paper, we study if we can utilize the meta-learning strategy to directly
predict the parameters of a single model with comparable performance of an en-
semble. Hereto, we introduce WeightFormer, a Transformer-based model that can
predict student network weights layer by layer in a forward pass, according to
the teacher model parameters. The proprieties of WeightFormer are investigated
on the CIFAR-10, CIFAR-100, and ImageNet datasets for model structures of
VGGNet-11, ResNet-50, and ViT-B/32, where it demonstrates that our method
can achieve approximate classification performance of an ensemble and outper-
forms both the single network and standard knowledge distillation. More encour-
agingly, we show that WeightFormer results can further exceeds average ensem-
ble with minor fine-tuning. Importantly, our task along with the model and results
can potentially lead to a new, more efficient, and scalable paradigm of ensemble
networks parameter learning.
1 INTRODUCTION
As machine learning models are being deployed ever more widely in practice, memory cost and
inference efficiency become increasingly important (Bucilua et al., 2006; Polino et al., 2018). En-
semble methods, which train several independent models to form a decision, are well known to yield
both improved performance and reliable estimations (Perrone & Cooper, 1992; Drucker et al., 1994;
Opitz & Maclin, 1999; Dietterich, 2000; Sagi & Rokach, 2018). Despite their useful property, using
ensembles can be computationally prohibitive. Obtaining predictions in real-time applications is
often expensive even for a single model, and the hardware requirements for serving an ensemble
scales linearly with number of teacher models (Buizza & Palmer, 1998; Bonab & Can, 2019). As a
result, over the past several years the area of knowledge distillation has gained increasing attention
(Hinton et al., 2015; Freitag et al., 2017; Malinin et al., 2019; Lin et al., 2020; Park et al., 2021; Zhao
et al., 2022). Broadly speaking, distillation methods aim to involve a single student model which
can approximate the behavior of a teacher ensemble, but at a low computational cost. In the simplest
and most frequently used form of distillation (Hinton et al., 2015), the student model is trained to
capture the average prediction of the ensemble, e.g., in the case of image classification, this reduces
to minimizing the KL divergence between the soft labels of student model and teacher models.
When optimizing the parameters for a new ensemble model, typical knowledge distillation process
disregards information on teacher model parameters and past training experience for distillation of
teacher models. However, leveraging this training information can be the key to reduce the high
computational demands. To progress in this direction, we propose a new task, referred to as Meta-
Ensemble Parameter Learning, where parameters of the distillation student model are directly pre-
dicted with a weight prediction network. The main idea is to use deep learning models to learn the
parameter distillation process and finally generate an entire student model by producing all model
weights in a single pass. This can reduce the overall computation cost in cases where the tasks
Corresponding author.
1
arXiv:2210.01973v1 [cs.CV] 5 Oct 2022
Teacher 1
Teacher 2
...
Teacher N
...
(a) Model Ensemble
Teacher 1
Teacher 2
...
Teacher N
...
Student
(b) Knowledge Distillation
Teacher 1
Teacher 2
...
Teacher N
...
Student
(c) Meta-Ensemble Parameter Learning
Weight
Predictor
Distillation Loss Distillation
Loss
Predicted Label
Soft Label Soft Label
Backward
...
Model Weights
Figure 1: Illustration of different knowledge induction frameworks.
or ensemble models update frequently. It is important to highlight that meta-ensemble parameter
learning, to our knowledge, has not been previously investigated. Figure 1 depicts various infor-
mation transfer paradigms, including model ensemble, knowledge distillation, and meta-ensemble
parameter learning. The dotted line represents the training flow.
To cover this task, we introduce WeightFormer, a model to directly predict the distilled student
model parameters. Our architecture takes inspiration from the Transformer (Vaswani et al., 2017)
and incorporates two key novelties to imitate the characteristics of model ensemble, i.e., cross-layer
information flow and shift consistency constraint. By designing these updated techniques, we then
evaluate the classification performance obtained by the predicted parameters on conventional convo-
lutional architectures VGGNet-11, ResNet-50, and transformer architecture ViT-B/32 (Dosovitskiy
et al., 2020), on the CIFAR-10, CIFAR-100, and ImageNet datasets, respectively. Experimental re-
sults show that the predicted models, yielded by our proposed method in one forward pass, approach
the performance of average ensemble and outperform the regular knowledge distillation models sig-
nificantly. Besides, our WeightFormer further exceeds the average ensemble with fine-tuning, which
is expected to hold good fitness on the variability and complexity of application scenarios.
Overall, our framework and results pave the road toward a new and significantly more efficient
paradigm for ensemble model parameter learning. The contributions of this paper are summarized
as follows: i) We introduce a novel task of directly predicting parameters of the distillation student
model based on multiple teacher neural network parameters, which encourages the exploring of past
ensemble training experience to improve the performance as well as reduce computation demand;
ii) We design WeightFormer, a simple and effective benchmark, with adjustments of built cross-
layer information flow and shift consistency constraints, to track progress on the model weight
generation task. Experimentally, our approach performs surprisingly well and robustly on different
model architectures and datasets; iii) We show that WeightFormer can be transferred to the scenario
of weight generation for unseen teacher models in a single forward pass, and more competitive
results can be obtained with additional fine-tuning data. Moreover, to improve the reproducibility
and foster new research in this field, we will publicly release the source code and trained models.
2 TASK FORMULATION
Here we consider the problem of distilling a neural network from several trained neural networks,
also known as teacher-student paradigm (Hinton et al., 2015), on image classification task (Rokach,
2010). It essentially aims to train a single student model that capture the mean decision of an ensem-
ble, allowing to achieve a higher performance with a far lower computation cost. This problem can
be formalized as finding optimal parameters ewfor target neural network e
f, given a neural network
set F={f1, . . . , fN}parameterized by W={w1, . . . , wN}, w.r.t. a loss function on the dataset
D={(xi, yi)}M
i=1 of input image xiand ground truth label yi:
min
ew
M
X
i=1
KL(ep(xi)|| 1
N
N
X
n=1
pn(xi)),(1)
where the optimization objective includes Kullback-Leible divergence denoted as KL(·) between the
mean soft labels from teacher models and the predictions from student model. pn(·)is the output
2
Transformer Block 1
...
Transformer Block L
1
Weight Embedder Dict
1 2 3 4 1 ... 4 1
0 1 1 1 1 2 ... N-1 N
[cross] ...
Relative Position
Model ID
Concatenated
Weight Matrices
Predicted
Weight
NNN
234
Cross-Layer
Features Consistency
Loss
Model Weight Shift
Cutoff
WeightFormer
WeightFormer
Figure 2: Overview of WeightFormer for the generation of one layer weights. Transformer-based
weight generator receives concatenated weight matrices of teacher models along with model id and
position information and produces the corresponding layer weights. After being generated, the
predicted student model is used to compute the loss on the training set, whose gradients are then
used to update the weights of WeightFormer. “[cross]” is a special token placed at the beginning of
all weight matrices to model the cross-layer information flow. The right part illustrates the process
of shift consistency, where predicted layer weights should be consistent with shifted input models
(see light brown token).
distribution of n-th network and ewis the resulting parameters of ensemble distillation model. Here
we assume all the teacher models hold the same network architecture and leave the ensemble learn-
ing of heterogeneous models as future work. Despite the progress in memory saving for distillation
ensemble model e
f, obtaining ewremains a bottleneck in large-scale machine learning pipelines. In
particular, with the growing size of network, the classical process of obtaining ensemble parameters,
retraining from scratch, is becoming computationally unsustainable.
In this paper, we highlight that the knowledge of preceding ensemble training in parameter optimiza-
tion is also important and propose a new task, named Meta-Ensemble Parameter Learning, where
parameters of distillation ensemble model are directly predicted with deep learning network. For-
mally, the task aims to generate the parameter ewof target model e
fin a single forward pass using a
specific weight generation network gθ, parameterized by θ:
ew=g([w1, . . . , wN]; θ).(2)
This task is constrained to a dataset D, so ewis the predicted parameter for which the test set per-
formance of e
f(x;ew)is approximate to the performance of model ensemble while maintaining a
training efficiency and scalability. In this manner, we can even distill the unseen teacher models to
achieve competitive performance without any training cost.
3 METHODOLOGY
In this section, we will describe our approach, dubbed as WeightFormer, to serve as an effective
solution for meta-ensemble parameter learning based on the Transformer structure. For simplicity,
we describe the prediction of CNN models containing a set of convolutional layers and two fully-
connected logits layers as well as self-attention layers in Transformer. Please note that most of
common parametric layers can be predicted by WeightFormer as presented in experiments.
3.1 REPRESENTATION OF MODEL PARAMETERS
For weight matrices in different layers of teacher models, provided with the convolutional kernel size
kand input / output channel number ninput /noutput, we consider the encoding of k×k×ninput ×
noutput convolutional kernels as noutput tokens with weight slices of k2×ninput dimensionality
and fully-connected logits layer ninput ×noutput weights as noutput tokens with dimensionality of
ninput (Zhmoginov et al., 2022). For parameters of self-attention layer, which includes hprojection
3
摘要:

META-ENSEMBLEPARAMETERLEARNINGZhengcongFei,ShumanTian,JunshiHuang,XiaomingWei,XiaolinWeiMeituanBeijing,Chinafnameg@meituan.comABSTRACTEnsembleofmachinelearningmodelsyieldsimprovedperformanceaswellasrobustness.However,theirmemoryrequirementsandinferencecostscanbepro-hibitivelyhigh.Knowledgedistillat...

展开>> 收起<<
META-ENSEMBLE PARAMETER LEARNING Zhengcong Fei Shuman Tian Junshi Huang Xiaoming Wei Xiaolin Wei Meituan.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:439.23KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注