META-ENSEMBLE PARAMETER LEARNING Zhengcong Fei Shuman Tian Junshi Huang Xiaoming Wei Xiaolin Wei Meituan

2025-05-02 0 0 439.23KB 13 页 10玖币

侵权投诉

META-ENSEMBLE PARAMETER LEARNING

Zhengcong Fei, Shuman Tian, Junshi Huang∗

, Xiaoming Wei, Xiaolin Wei

Meituan

Beijing, China

{name}@meituan.com

ABSTRACT

Ensemble of machine learning models yields improved performance as well as

robustness. However, their memory requirements and inference costs can be pro-

hibitively high. Knowledge distillation is an approach that allows a single model

to efﬁciently capture the approximate performance of an ensemble while showing

poor scalability as demand for re-training when introducing new teacher models.

In this paper, we study if we can utilize the meta-learning strategy to directly

predict the parameters of a single model with comparable performance of an en-

semble. Hereto, we introduce WeightFormer, a Transformer-based model that can

predict student network weights layer by layer in a forward pass, according to

the teacher model parameters. The proprieties of WeightFormer are investigated

on the CIFAR-10, CIFAR-100, and ImageNet datasets for model structures of

VGGNet-11, ResNet-50, and ViT-B/32, where it demonstrates that our method

can achieve approximate classiﬁcation performance of an ensemble and outper-

forms both the single network and standard knowledge distillation. More encour-

agingly, we show that WeightFormer results can further exceeds average ensem-

ble with minor ﬁne-tuning. Importantly, our task along with the model and results

can potentially lead to a new, more efﬁcient, and scalable paradigm of ensemble

networks parameter learning.

1 INTRODUCTION

As machine learning models are being deployed ever more widely in practice, memory cost and

inference efﬁciency become increasingly important (Bucilua et al., 2006; Polino et al., 2018). En-

semble methods, which train several independent models to form a decision, are well known to yield

both improved performance and reliable estimations (Perrone & Cooper, 1992; Drucker et al., 1994;

Opitz & Maclin, 1999; Dietterich, 2000; Sagi & Rokach, 2018). Despite their useful property, using

ensembles can be computationally prohibitive. Obtaining predictions in real-time applications is

often expensive even for a single model, and the hardware requirements for serving an ensemble

scales linearly with number of teacher models (Buizza & Palmer, 1998; Bonab & Can, 2019). As a

result, over the past several years the area of knowledge distillation has gained increasing attention

(Hinton et al., 2015; Freitag et al., 2017; Malinin et al., 2019; Lin et al., 2020; Park et al., 2021; Zhao

et al., 2022). Broadly speaking, distillation methods aim to involve a single student model which

can approximate the behavior of a teacher ensemble, but at a low computational cost. In the simplest

and most frequently used form of distillation (Hinton et al., 2015), the student model is trained to

capture the average prediction of the ensemble, e.g., in the case of image classiﬁcation, this reduces

to minimizing the KL divergence between the soft labels of student model and teacher models.

When optimizing the parameters for a new ensemble model, typical knowledge distillation process

disregards information on teacher model parameters and past training experience for distillation of

teacher models. However, leveraging this training information can be the key to reduce the high

computational demands. To progress in this direction, we propose a new task, referred to as Meta-

Ensemble Parameter Learning, where parameters of the distillation student model are directly pre-

dicted with a weight prediction network. The main idea is to use deep learning models to learn the

parameter distillation process and ﬁnally generate an entire student model by producing all model

weights in a single pass. This can reduce the overall computation cost in cases where the tasks

∗Corresponding author.

arXiv:2210.01973v1 [cs.CV] 5 Oct 2022

Teacher 1

Teacher 2

...

Teacher N

...

(a) Model Ensemble

Teacher 1

Teacher 2

...

Teacher N

...

Student

(b) Knowledge Distillation

Teacher 1

Teacher 2

...

Teacher N

...

Student

Weight

Predictor

Distillation Loss Distillation

Loss

Predicted Label

Soft Label Soft Label

Backward

...

Model Weights

Figure 1: Illustration of different knowledge induction frameworks.

or ensemble models update frequently. It is important to highlight that meta-ensemble parameter

learning, to our knowledge, has not been previously investigated. Figure 1 depicts various infor-

mation transfer paradigms, including model ensemble, knowledge distillation, and meta-ensemble

parameter learning. The dotted line represents the training ﬂow.

To cover this task, we introduce WeightFormer, a model to directly predict the distilled student

model parameters. Our architecture takes inspiration from the Transformer (Vaswani et al., 2017)

and incorporates two key novelties to imitate the characteristics of model ensemble, i.e., cross-layer

information ﬂow and shift consistency constraint. By designing these updated techniques, we then

evaluate the classiﬁcation performance obtained by the predicted parameters on conventional convo-

lutional architectures VGGNet-11, ResNet-50, and transformer architecture ViT-B/32 (Dosovitskiy

et al., 2020), on the CIFAR-10, CIFAR-100, and ImageNet datasets, respectively. Experimental re-

sults show that the predicted models, yielded by our proposed method in one forward pass, approach

the performance of average ensemble and outperform the regular knowledge distillation models sig-

niﬁcantly. Besides, our WeightFormer further exceeds the average ensemble with ﬁne-tuning, which

is expected to hold good ﬁtness on the variability and complexity of application scenarios.

Overall, our framework and results pave the road toward a new and signiﬁcantly more efﬁcient

paradigm for ensemble model parameter learning. The contributions of this paper are summarized

as follows: i) We introduce a novel task of directly predicting parameters of the distillation student

model based on multiple teacher neural network parameters, which encourages the exploring of past

ensemble training experience to improve the performance as well as reduce computation demand;

ii) We design WeightFormer, a simple and effective benchmark, with adjustments of built cross-

layer information ﬂow and shift consistency constraints, to track progress on the model weight

generation task. Experimentally, our approach performs surprisingly well and robustly on different

model architectures and datasets; iii) We show that WeightFormer can be transferred to the scenario

of weight generation for unseen teacher models in a single forward pass, and more competitive

results can be obtained with additional ﬁne-tuning data. Moreover, to improve the reproducibility

and foster new research in this ﬁeld, we will publicly release the source code and trained models.

2 TASK FORMULATION

Here we consider the problem of distilling a neural network from several trained neural networks,

also known as teacher-student paradigm (Hinton et al., 2015), on image classiﬁcation task (Rokach,

2010). It essentially aims to train a single student model that capture the mean decision of an ensem-

ble, allowing to achieve a higher performance with a far lower computation cost. This problem can

be formalized as ﬁnding optimal parameters ewfor target neural network e

f, given a neural network

set F={f1, . . . , fN}parameterized by W={w1, . . . , wN}, w.r.t. a loss function on the dataset

D={(xi, yi)}M

i=1 of input image xiand ground truth label yi:

min

i=1

KL(ep(xi)|| 1

n=1

pn(xi)),(1)

where the optimization objective includes Kullback-Leible divergence denoted as KL(·) between the

mean soft labels from teacher models and the predictions from student model. pn(·)is the output

Transformer Block 1

...

Transformer Block L

Weight Embedder Dict

1 2 3 4 1 ... 4 1

0 1 1 1 1 2 ... N-1 N

[cross] ...

Relative Position

Model ID

Concatenated

Weight Matrices

Predicted

Weight

NNN

234

Cross-Layer

Features Consistency

Loss

Model Weight Shift

Cutoff

WeightFormer

Figure 2: Overview of WeightFormer for the generation of one layer weights. Transformer-based

weight generator receives concatenated weight matrices of teacher models along with model id and

position information and produces the corresponding layer weights. After being generated, the

predicted student model is used to compute the loss on the training set, whose gradients are then

used to update the weights of WeightFormer. “[cross]” is a special token placed at the beginning of

all weight matrices to model the cross-layer information ﬂow. The right part illustrates the process

of shift consistency, where predicted layer weights should be consistent with shifted input models

(see light brown token).

distribution of n-th network and ewis the resulting parameters of ensemble distillation model. Here

we assume all the teacher models hold the same network architecture and leave the ensemble learn-

ing of heterogeneous models as future work. Despite the progress in memory saving for distillation

ensemble model e

f, obtaining ewremains a bottleneck in large-scale machine learning pipelines. In

particular, with the growing size of network, the classical process of obtaining ensemble parameters,

retraining from scratch, is becoming computationally unsustainable.

In this paper, we highlight that the knowledge of preceding ensemble training in parameter optimiza-

tion is also important and propose a new task, named Meta-Ensemble Parameter Learning, where

parameters of distillation ensemble model are directly predicted with deep learning network. For-

mally, the task aims to generate the parameter ewof target model e

fin a single forward pass using a

speciﬁc weight generation network gθ, parameterized by θ:

ew=g([w1, . . . , wN]; θ).(2)

This task is constrained to a dataset D, so ewis the predicted parameter for which the test set per-

formance of e

f(x;ew)is approximate to the performance of model ensemble while maintaining a

training efﬁciency and scalability. In this manner, we can even distill the unseen teacher models to

achieve competitive performance without any training cost.

3 METHODOLOGY

In this section, we will describe our approach, dubbed as WeightFormer, to serve as an effective

solution for meta-ensemble parameter learning based on the Transformer structure. For simplicity,

we describe the prediction of CNN models containing a set of convolutional layers and two fully-

connected logits layers as well as self-attention layers in Transformer. Please note that most of

common parametric layers can be predicted by WeightFormer as presented in experiments.

3.1 REPRESENTATION OF MODEL PARAMETERS

For weight matrices in different layers of teacher models, provided with the convolutional kernel size

kand input / output channel number ninput /noutput, we consider the encoding of k×k×ninput ×

noutput convolutional kernels as noutput tokens with weight slices of k2×ninput dimensionality

and fully-connected logits layer ninput ×noutput weights as noutput tokens with dimensionality of

ninput (Zhmoginov et al., 2022). For parameters of self-attention layer, which includes hprojection

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

META-ENSEMBLEPARAMETERLEARNINGZhengcongFei,ShumanTian,JunshiHuang,XiaomingWei,XiaolinWeiMeituanBeijing,Chinafnameg@meituan.comABSTRACTEnsembleofmachinelearningmodelsyieldsimprovedperformanceaswellasrobustness.However,theirmemoryrequirementsandinferencecostscanbepro-hibitivelyhigh.Knowledgedistillat...

展开>> 收起<<

META-ENSEMBLE PARAMETER LEARNING Zhengcong Fei Shuman Tian Junshi Huang Xiaoming Wei Xiaolin Wei Meituan.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

META-ENSEMBLE PARAMETER LEARNING Zhengcong Fei Shuman Tian Junshi Huang Xiaoming Wei Xiaolin Wei Meituan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: