Parameter-Efﬁcient Masking Networks Yue Bai1Huan Wang14Xu Ma1Yitian Zhang1Zhiqiang Tao3Yun Fu124 1Department of Electrical and Computer Engineering Northeastern University

2025-04-24 0 0 585.21KB 16 页 10玖币

侵权投诉

Parameter-Efﬁcient Masking Networks

Yue Bai1,∗Huan Wang1,4Xu Ma1Yitian Zhang1Zhiqiang Tao3Yun Fu1,2,4

1Department of Electrical and Computer Engineering, Northeastern University

2Khoury College of Computer Science, Northeastern University

3School of Information, Rochester Institute of Technology

4AInnovation Labs, Inc.

Project Homepage: https://yueb17.github.io/PEMN

Abstract

A deeper network structure generally handles more complicated non-linearity and

performs more competitively. Nowadays, advanced network designs often contain

a large number of repetitive structures (e.g., Transformer). They empower the

network capacity to a new level but also increase the model size inevitably, which is

unfriendly to either model restoring or transferring. In this study, we are the ﬁrst to

investigate the representative potential of ﬁxed random weights with limited unique

values by learning diverse masks and introduce the Parameter-Efﬁcient Masking

Networks (PEMN). It also naturally leads to a new paradigm for model compres-

sion to diminish the model size. Concretely, motivated by the repetitive structures

in modern neural networks, we utilize one random initialized layer, accompanied

with different masks, to convey different feature mappings and represent repetitive

network modules. Therefore, the model can be expressed as one-layer with a

bunch of masks, which signiﬁcantly reduce the model storage cost. Furthermore,

we enhance our strategy by learning masks for a model ﬁlled by padding a given

random weights vector. In this way, our method can further lower the space com-

plexity, especially for models without many repetitive architectures. We validate

the potential of PEMN learning masks on random weights with limited unique

values and test its effectiveness for a new compression paradigm based on different

network architectures. Code is available at https://github.com/yueb17/PEMN.

1 Introduction

Deep neural networks have emerged in several application ﬁelds and achieved state-of-the-art perfor-

mances [

]. Along with the data explosion in this era, huge amount of data gathered to build

network models with higher capacity [

]. In addition, researchers also pursue a uniﬁed network

framework to deal with multi-modal and multi-task problems as a powerful intelligent model [

All these trending topics inevitably require even larger and deeper network models to tackle diverse

data ﬂows, arising new challenges to compress and transmit models, especially for mobile systems.

Despite the success of recent years with promising task performances, advanced neural networks

suffer from their growing size, which causes inconvenience for both model storage and transferring.

To reduce the model size of a given network architecture, neural network pruning is a typical

technique [

]. Pruning approaches remove redundant weights using designed criteria and

the pruning operation can be conducted for both pretrained model (conventional pruning: [

])

and randomly initialized model (pruning at initialization: [

]). Another promising direction

is to obtain sparse network by dynamic sparse training [

]. They jointly optimize network

architectures and weights to ﬁnd good sparse networks. Basically, these methods commonly demand

regular training, and the ﬁnal weights are updated by optimization algorithms like SGD automatically.

*Corresponding author: bai.yue@northeastern.edu

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06699v1 [cs.LG] 13 Oct 2022

Now that the trained weights have such a great representative capacity, one may wonder what is

the potential of random and ﬁxed weights or is it possible to achieve the same performance on

random weights? If we consider a whole network, the answer is obviously negative as a random

network cannot provide informative and distinguishable outputs. However, picking a subnetwork

from a random dense network make it possible as feature mapping varies with changes of subnetwork

structures. Then, the question has been updated as what is the representative potential of random

and ﬁxed weights with selecting subnetwork structures? Pioneer work LTH [

] shows the winning

ticket exists in random network with good trainability but cannot be used directly without further

training. Supermasks [

] enhances the winning ticket and enable it being usable directly. Recent

work Popup [

] signiﬁcantly improves subnetwork capacity from its dense counterpart by learning

the masks using backpropagation. Following this insightful perspective, we further ask a question

–what is the maximum representative potential of a set of random weights? In our work, we

ﬁrst make a thorough exploration of this scientiﬁc question to propose our Parameter-Efﬁcient

Masking Networks (PEMN). Then, leveraging on the PEMN, we naturally introduce a new network

compression paradigm by combining a set of ﬁxed random weights with a corresponding learned

mask to represent the whole network.

We start with network architectures which recent popular design style, i.e., building a small-scale

encoding module and stacking it to obtain a deep neural network [

]. Based on this point, we

naturally propose the One-layer strategy by using one module as a prototype and copy its parameter

into other repetitive structures. More generally, we further provide two versions: max-layer padding

(MP) and random weight padding (RP) to handle diverse network structures. Speciﬁcally, MP chooses

the layer with the most number of parameters as the prototype and uses ﬁrst certain parameters of

prototype to ﬁll in other layers. RP even breaks the constraint of network architecture. It samples a

random vector with certain length as the prototype which is copied several times to ﬁll in all layers

based on their different lengths. RP is architecture-agnostic and can be seen as a most general strategy

in our work. Three strategies are from speciﬁc to general manner and reduce the number of unique

parameters gradually. We ﬁrst employ these strategies to randomly initialize network. Then, we learn

different masks to explore the random weights potential and positively answer the scientiﬁc question

above. Leveraging on it, we propose a new network compression paradigm by using a set of random

weights with a bunch of masks to represent a network model instead of restoring sparse weights

for all layers (see Fig. 1). We conduct comprehensive experiments to explore the random weights

representative potential and test the model compression performance to validate our paradigm. We

summarize our contributions as below:

•

We scientiﬁcally explore the representative potential of ﬁxed random weights with lim-

ited unique values and introduce our Parameter-Efﬁcient Masking Networks (PEMN). It

leverages on learning different masks to represent different feature mappings.

•

A novel network compression paradigm is naturally proposed by fully utilizing the repre-

sentative capacity of random weights. We represent and restore a network based on a given

random vector with a bunch of masks instead of retaining all the sparse weights.

•

Extensive experimental results explore the random weights potential by using our PEMN

and test the compression performance of our new paradigm. We expect our work can inspire

more interesting explorations in this direction.

2 Related Works

2.1 Sparse Network Training

Our work is related to sparse network training. Conventional pruning techniques ﬁnetune the pruned

network from pretrained models [

] with various pruning criteria for different applications [

]. Instead of pruning a pretrained model, pruning at initialization [

] approaches

attempt to ﬁnd winning ticket from the random weight. Gradient information is considered to build

pruning criteria in [

]. Different from pruning methods above, sparse network training also

can be conducted in a dynamic fashion. To name a few, Rigging the Lottery [

] edits the network

connections and jointly updates the learnable weights. Dynamic Sparse Reparameterization [

]

modiﬁes the parameter budget among the whole network dynamically. Sparse Networks from

Scratch [

] proposes a momentum based approach to adaptively grow weights and empirically

Several sparse

weight matrix

… …

Parameter-Efficient Masking NetworksConventional Sparse Network Random fixed weightsOptimized weights

Pruning using criteria Selecting using masks

…

Feature mappings Feature mappings

One-layer with

several masks

…

Figure 1: Comparison of different ways to represent a neural network. Different features mappings

are shown as blue rectangles. Squares with different color patches inside serve as parameters of

different layers. Left is the conventional fashion where weights are optimized and sparse structures

are decided by certain criteria. Right is our PEMN to represent a network where the prototype weights

are ﬁxed and repetitively used to ﬁll in the whole network and different masks are learned to deliver

different feature mappings. Following this line, we explore the representative potential of random

weights and propose a novel paradigm to achieve model compression by combining a set of random

weights and a bunch of masks.

veriﬁes its effectiveness. Most of the sparse network training achieve the network sparsity by keeping

necessary weights and removing others, which reduces the cost of model storage and transferring. In

our work, we propose a novel model compression paradigm by leveraging the representative potential

of random weights accompanied with subnetwork selection.

2.2 Random Network Selection

Our work inherits the research line of exploring the representative capacity of random network. The

potential of randomly initialized network is pioneeringly explored by the Lottery Ticket Hypoth-

esis [

], and further investigated by [

]. It articulates that there exists a winning ticket

subnetwork in a random dense network. This subnetwork can be trained in isolation and achieves

comparable results with its dense counterpart. Moreover, the potential of the winning ticket is further

explored in Supermasks [

]. It surprisingly discovers the subnetwork can be identiﬁed from dense

network to obtain reasonable performance without training. It extends and proves the potential of

subnetwork from good trainability to being used directly. More recently, the representative capacity

of subnetworks is enhanced by Popup algorithm proposed by [

]. Based on random dense initial-

ization, the learnable mask is optimized to obtain subnetwork with promising results. Instead of

considering network with random weights, the network with the same shared parameters can also

delivery representative capacity to some extent, which is investigated by Weight Agnostic Neural

Network [

] and also inspires this research direction. We are highly motivated by these researches to

validate how is the representative potential of random weights with limited unique values by learning

various masks.

2.3 Weight Sharing

Our study is also related to several recent works about weight sharing. This strategy has been explored

and analyzed in convolutional neural networks for their efﬁciency [

]. In addition, several works

are also proposed for efﬁcient transformer architecture using weight sharing strategy [

]. There

are two main differences between these works and our study: 1) They follow the regular optimization

strategy to learn the weight in a recurrent fashion, which is closer to the recurrent neural network.

Our work follows a different setting. We use ﬁxed repetitive random weights to ﬁll in the whole

network and employs different masks to represent different feature mappings; 2) They mainly conduct

cross-layer weight sharing for repetitive transformer structure. In our work, we explore the potential

of random weight vector with limited length as much smaller repetitive granularity to ﬁll in the whole

network, which is more challenging than cross-layer sharing strategy.

…

… …

Regular One-Layer

Max-Layer Padding

…

Random Vector Padding

Randomly initialized Prototype

Prototype

… …

Prototype

… …

Figure 2: Illustrations of different strategies in PEMN to represent network structures. Compared with

regular fashion where all parameters are randomly initialized, we provide three parameter-efﬁcient

strategies, One-layer,Max-layer padding (MP), and Random vector padding (RP), to fully explore

the representative capacity of random weights.

3 Parameter-Efﬁcient Masking Networks

3.1 Instinctive Motivation

Overparameterized randomly initialized neural network beneﬁts network optimization to get higher

performance. Inevitably, the trained network contains redundant parameters but can be further

compressed, which deﬁnes the conventional neural network pruning. On the other side, the network

redundancy also ensures a large random network contains a huge number of possible subnetworks,

thus, carefully selecting a speciﬁc subnetwork should obtain promising performances. This point of

view has been proved by [

]. These works demonstrate the representative potential of certain

subset combinations of a given random weights. Following this lane, we naturally ask a question:

what is the maximum representative potential of a set of random weights? or in another word: can

we use random weights with limited unique values to represent a usable network? We answer this

question as positive and introduce our Parameter-Efﬁcient Masking Networks (PEMN). Moreover,

leveraging on 1) compared with trained network where the weight values cannot be predicted, we

can pre-access the random weights before we select the subnetwork; 2) selected subnetwork can be

efﬁciently represented by a bunch of masks, we can extremely reduce the network storage size and

establish a new paradigm for network compression.

3.2 Sparse Selection

We follow [

] to conduct the sparse network selection. We start from a randomly initialized neural

network consisting of Llayers. For each l∈ {1,2, ..., L}, it has

Il+1 =σ(F[Il;wl]),(1)

where Iland Il+1 are the input and output of layer l.σis the activation. Frepresents the encoding

layer such as convolutional or linear layer with parameter

wl={w1

l, w2

l, ..., wdl

, where

is the

parameter dimension of layer

. To perform the sparse selection, all the weights

w={w1, w2, ..., wL}

are ﬁxed and denoted as

. To pick the ﬁxed weights for subnetwork, each weight

is assigned a

learnable element-wise score sj

lto indicate its importance in the network. The Eq. 1 is rewrited as

Il+1 =σ(F[Il;wlh(sl)]),(2)

where

sl={s1

l, s2

l, ..., sdl

is the score vector and

h(·)

is the indicator function to create the mask. It

outputs 1 when the value of

belongs to the top K% highest scores and outputs 0 for others, where

K is predeﬁned sparse selection ratio. Through optimizing

with ﬁxed

, a subset of original dense

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Parameter-EfcientMaskingNetworksYueBai1;HuanWang1;4XuMa1YitianZhang1ZhiqiangTao3YunFu1;2;41DepartmentofElectricalandComputerEngineering,NortheasternUniversity2KhouryCollegeofComputerScience,NortheasternUniversity3SchoolofInformation,RochesterInstituteofTechnology4AInnovationLabs,Inc.ProjectHomepag...

展开>> 收起<<

Parameter-Efﬁcient Masking Networks Yue Bai1Huan Wang14Xu Ma1Yitian Zhang1Zhiqiang Tao3Yun Fu124 1Department of Electrical and Computer Engineering Northeastern University.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Parameter-Efﬁcient Masking Networks Yue Bai1Huan Wang14Xu Ma1Yitian Zhang1Zhiqiang Tao3Yun Fu124 1Department of Electrical and Computer Engineering Northeastern University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: