Parameter-Efficient Masking Networks Yue Bai1Huan Wang14Xu Ma1Yitian Zhang1Zhiqiang Tao3Yun Fu124 1Department of Electrical and Computer Engineering Northeastern University

2025-04-24 0 0 585.21KB 16 页 10玖币
侵权投诉
Parameter-Efficient Masking Networks
Yue Bai1,Huan Wang1,4Xu Ma1Yitian Zhang1Zhiqiang Tao3Yun Fu1,2,4
1Department of Electrical and Computer Engineering, Northeastern University
2Khoury College of Computer Science, Northeastern University
3School of Information, Rochester Institute of Technology
4AInnovation Labs, Inc.
Project Homepage: https://yueb17.github.io/PEMN
Abstract
A deeper network structure generally handles more complicated non-linearity and
performs more competitively. Nowadays, advanced network designs often contain
a large number of repetitive structures (e.g., Transformer). They empower the
network capacity to a new level but also increase the model size inevitably, which is
unfriendly to either model restoring or transferring. In this study, we are the first to
investigate the representative potential of fixed random weights with limited unique
values by learning diverse masks and introduce the Parameter-Efficient Masking
Networks (PEMN). It also naturally leads to a new paradigm for model compres-
sion to diminish the model size. Concretely, motivated by the repetitive structures
in modern neural networks, we utilize one random initialized layer, accompanied
with different masks, to convey different feature mappings and represent repetitive
network modules. Therefore, the model can be expressed as one-layer with a
bunch of masks, which significantly reduce the model storage cost. Furthermore,
we enhance our strategy by learning masks for a model filled by padding a given
random weights vector. In this way, our method can further lower the space com-
plexity, especially for models without many repetitive architectures. We validate
the potential of PEMN learning masks on random weights with limited unique
values and test its effectiveness for a new compression paradigm based on different
network architectures. Code is available at https://github.com/yueb17/PEMN.
1 Introduction
Deep neural networks have emerged in several application fields and achieved state-of-the-art perfor-
mances [
9
,
18
,
34
]. Along with the data explosion in this era, huge amount of data gathered to build
network models with higher capacity [
4
,
8
,
30
]. In addition, researchers also pursue a unified network
framework to deal with multi-modal and multi-task problems as a powerful intelligent model [
30
,
42
].
All these trending topics inevitably require even larger and deeper network models to tackle diverse
data flows, arising new challenges to compress and transmit models, especially for mobile systems.
Despite the success of recent years with promising task performances, advanced neural networks
suffer from their growing size, which causes inconvenience for both model storage and transferring.
To reduce the model size of a given network architecture, neural network pruning is a typical
technique [
26
,
24
,
13
]. Pruning approaches remove redundant weights using designed criteria and
the pruning operation can be conducted for both pretrained model (conventional pruning: [
14
,
13
])
and randomly initialized model (pruning at initialization: [
25
,
37
]). Another promising direction
is to obtain sparse network by dynamic sparse training [
10
,
29
]. They jointly optimize network
architectures and weights to find good sparse networks. Basically, these methods commonly demand
regular training, and the final weights are updated by optimization algorithms like SGD automatically.
*Corresponding author: bai.yue@northeastern.edu
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06699v1 [cs.LG] 13 Oct 2022
Now that the trained weights have such a great representative capacity, one may wonder what is
the potential of random and fixed weights or is it possible to achieve the same performance on
random weights? If we consider a whole network, the answer is obviously negative as a random
network cannot provide informative and distinguishable outputs. However, picking a subnetwork
from a random dense network make it possible as feature mapping varies with changes of subnetwork
structures. Then, the question has been updated as what is the representative potential of random
and fixed weights with selecting subnetwork structures? Pioneer work LTH [
11
] shows the winning
ticket exists in random network with good trainability but cannot be used directly without further
training. Supermasks [
44
] enhances the winning ticket and enable it being usable directly. Recent
work Popup [
31
] significantly improves subnetwork capacity from its dense counterpart by learning
the masks using backpropagation. Following this insightful perspective, we further ask a question
what is the maximum representative potential of a set of random weights? In our work, we
first make a thorough exploration of this scientific question to propose our Parameter-Efficient
Masking Networks (PEMN). Then, leveraging on the PEMN, we naturally introduce a new network
compression paradigm by combining a set of fixed random weights with a corresponding learned
mask to represent the whole network.
We start with network architectures which recent popular design style, i.e., building a small-scale
encoding module and stacking it to obtain a deep neural network [
9
,
33
,
32
]. Based on this point, we
naturally propose the One-layer strategy by using one module as a prototype and copy its parameter
into other repetitive structures. More generally, we further provide two versions: max-layer padding
(MP) and random weight padding (RP) to handle diverse network structures. Specifically, MP chooses
the layer with the most number of parameters as the prototype and uses first certain parameters of
prototype to fill in other layers. RP even breaks the constraint of network architecture. It samples a
random vector with certain length as the prototype which is copied several times to fill in all layers
based on their different lengths. RP is architecture-agnostic and can be seen as a most general strategy
in our work. Three strategies are from specific to general manner and reduce the number of unique
parameters gradually. We first employ these strategies to randomly initialize network. Then, we learn
different masks to explore the random weights potential and positively answer the scientific question
above. Leveraging on it, we propose a new network compression paradigm by using a set of random
weights with a bunch of masks to represent a network model instead of restoring sparse weights
for all layers (see Fig. 1). We conduct comprehensive experiments to explore the random weights
representative potential and test the model compression performance to validate our paradigm. We
summarize our contributions as below:
We scientifically explore the representative potential of fixed random weights with lim-
ited unique values and introduce our Parameter-Efficient Masking Networks (PEMN). It
leverages on learning different masks to represent different feature mappings.
A novel network compression paradigm is naturally proposed by fully utilizing the repre-
sentative capacity of random weights. We represent and restore a network based on a given
random vector with a bunch of masks instead of retaining all the sparse weights.
Extensive experimental results explore the random weights potential by using our PEMN
and test the compression performance of our new paradigm. We expect our work can inspire
more interesting explorations in this direction.
2 Related Works
2.1 Sparse Network Training
Our work is related to sparse network training. Conventional pruning techniques finetune the pruned
network from pretrained models [
13
,
14
] with various pruning criteria for different applications [
26
,
15
,
16
,
38
,
36
,
40
,
41
]. Instead of pruning a pretrained model, pruning at initialization [
37
] approaches
attempt to find winning ticket from the random weight. Gradient information is considered to build
pruning criteria in [
25
,
35
]. Different from pruning methods above, sparse network training also
can be conducted in a dynamic fashion. To name a few, Rigging the Lottery [
10
] edits the network
connections and jointly updates the learnable weights. Dynamic Sparse Reparameterization [
29
]
modifies the parameter budget among the whole network dynamically. Sparse Networks from
Scratch [
7
] proposes a momentum based approach to adaptively grow weights and empirically
2
Several sparse
weight matrix
… …
Parameter-Efficient Masking NetworksConventional Sparse Network Random fixed weightsOptimized weights
Pruning using criteria Selecting using masks
Feature mappings Feature mappings
One-layer with
several masks
Figure 1: Comparison of different ways to represent a neural network. Different features mappings
are shown as blue rectangles. Squares with different color patches inside serve as parameters of
different layers. Left is the conventional fashion where weights are optimized and sparse structures
are decided by certain criteria. Right is our PEMN to represent a network where the prototype weights
are fixed and repetitively used to fill in the whole network and different masks are learned to deliver
different feature mappings. Following this line, we explore the representative potential of random
weights and propose a novel paradigm to achieve model compression by combining a set of random
weights and a bunch of masks.
verifies its effectiveness. Most of the sparse network training achieve the network sparsity by keeping
necessary weights and removing others, which reduces the cost of model storage and transferring. In
our work, we propose a novel model compression paradigm by leveraging the representative potential
of random weights accompanied with subnetwork selection.
2.2 Random Network Selection
Our work inherits the research line of exploring the representative capacity of random network. The
potential of randomly initialized network is pioneeringly explored by the Lottery Ticket Hypoth-
esis [
11
], and further investigated by [
28
,
2
,
39
]. It articulates that there exists a winning ticket
subnetwork in a random dense network. This subnetwork can be trained in isolation and achieves
comparable results with its dense counterpart. Moreover, the potential of the winning ticket is further
explored in Supermasks [
44
]. It surprisingly discovers the subnetwork can be identified from dense
network to obtain reasonable performance without training. It extends and proves the potential of
subnetwork from good trainability to being used directly. More recently, the representative capacity
of subnetworks is enhanced by Popup algorithm proposed by [
31
]. Based on random dense initial-
ization, the learnable mask is optimized to obtain subnetwork with promising results. Instead of
considering network with random weights, the network with the same shared parameters can also
delivery representative capacity to some extent, which is investigated by Weight Agnostic Neural
Network [
12
] and also inspires this research direction. We are highly motivated by these researches to
validate how is the representative potential of random weights with limited unique values by learning
various masks.
2.3 Weight Sharing
Our study is also related to several recent works about weight sharing. This strategy has been explored
and analyzed in convolutional neural networks for their efficiency [
19
,
43
]. In addition, several works
are also proposed for efficient transformer architecture using weight sharing strategy [
22
,
5
,
1
]. There
are two main differences between these works and our study: 1) They follow the regular optimization
strategy to learn the weight in a recurrent fashion, which is closer to the recurrent neural network.
Our work follows a different setting. We use fixed repetitive random weights to fill in the whole
network and employs different masks to represent different feature mappings; 2) They mainly conduct
cross-layer weight sharing for repetitive transformer structure. In our work, we explore the potential
of random weight vector with limited length as much smaller repetitive granularity to fill in the whole
network, which is more challenging than cross-layer sharing strategy.
3
… …
Regular One-Layer
Max-Layer Padding
Random Vector Padding
Randomly initialized Prototype
Prototype
… …
Prototype
… …
Figure 2: Illustrations of different strategies in PEMN to represent network structures. Compared with
regular fashion where all parameters are randomly initialized, we provide three parameter-efficient
strategies, One-layer,Max-layer padding (MP), and Random vector padding (RP), to fully explore
the representative capacity of random weights.
3 Parameter-Efficient Masking Networks
3.1 Instinctive Motivation
Overparameterized randomly initialized neural network benefits network optimization to get higher
performance. Inevitably, the trained network contains redundant parameters but can be further
compressed, which defines the conventional neural network pruning. On the other side, the network
redundancy also ensures a large random network contains a huge number of possible subnetworks,
thus, carefully selecting a specific subnetwork should obtain promising performances. This point of
view has been proved by [
31
,
44
]. These works demonstrate the representative potential of certain
subset combinations of a given random weights. Following this lane, we naturally ask a question:
what is the maximum representative potential of a set of random weights? or in another word: can
we use random weights with limited unique values to represent a usable network? We answer this
question as positive and introduce our Parameter-Efficient Masking Networks (PEMN). Moreover,
leveraging on 1) compared with trained network where the weight values cannot be predicted, we
can pre-access the random weights before we select the subnetwork; 2) selected subnetwork can be
efficiently represented by a bunch of masks, we can extremely reduce the network storage size and
establish a new paradigm for network compression.
3.2 Sparse Selection
We follow [
31
] to conduct the sparse network selection. We start from a randomly initialized neural
network consisting of Llayers. For each l∈ {1,2, ..., L}, it has
Il+1 =σ(F[Il;wl]),(1)
where Iland Il+1 are the input and output of layer l.σis the activation. Frepresents the encoding
layer such as convolutional or linear layer with parameter
wl={w1
l, w2
l, ..., wdl
l}
, where
dl
is the
parameter dimension of layer
l
. To perform the sparse selection, all the weights
w={w1, w2, ..., wL}
are fixed and denoted as
ew
. To pick the fixed weights for subnetwork, each weight
wj
l
is assigned a
learnable element-wise score sj
lto indicate its importance in the network. The Eq. 1 is rewrited as
Il+1 =σ(F[Il;wlh(sl)]),(2)
where
sl={s1
l, s2
l, ..., sdl
l}
is the score vector and
h(·)
is the indicator function to create the mask. It
outputs 1 when the value of
sj
l
belongs to the top K% highest scores and outputs 0 for others, where
K is predefined sparse selection ratio. Through optimizing
s
with fixed
w
, a subset of original dense
4
摘要:

Parameter-EfcientMaskingNetworksYueBai1;HuanWang1;4XuMa1YitianZhang1ZhiqiangTao3YunFu1;2;41DepartmentofElectricalandComputerEngineering,NortheasternUniversity2KhouryCollegeofComputerScience,NortheasternUniversity3SchoolofInformation,RochesterInstituteofTechnology4AInnovationLabs,Inc.ProjectHomepag...

展开>> 收起<<
Parameter-Efficient Masking Networks Yue Bai1Huan Wang14Xu Ma1Yitian Zhang1Zhiqiang Tao3Yun Fu124 1Department of Electrical and Computer Engineering Northeastern University.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:585.21KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注