Now that the trained weights have such a great representative capacity, one may wonder what is
the potential of random and fixed weights or is it possible to achieve the same performance on
random weights? If we consider a whole network, the answer is obviously negative as a random
network cannot provide informative and distinguishable outputs. However, picking a subnetwork
from a random dense network make it possible as feature mapping varies with changes of subnetwork
structures. Then, the question has been updated as what is the representative potential of random
and fixed weights with selecting subnetwork structures? Pioneer work LTH [
11
] shows the winning
ticket exists in random network with good trainability but cannot be used directly without further
training. Supermasks [
44
] enhances the winning ticket and enable it being usable directly. Recent
work Popup [
31
] significantly improves subnetwork capacity from its dense counterpart by learning
the masks using backpropagation. Following this insightful perspective, we further ask a question
–what is the maximum representative potential of a set of random weights? In our work, we
first make a thorough exploration of this scientific question to propose our Parameter-Efficient
Masking Networks (PEMN). Then, leveraging on the PEMN, we naturally introduce a new network
compression paradigm by combining a set of fixed random weights with a corresponding learned
mask to represent the whole network.
We start with network architectures which recent popular design style, i.e., building a small-scale
encoding module and stacking it to obtain a deep neural network [
9
,
33
,
32
]. Based on this point, we
naturally propose the One-layer strategy by using one module as a prototype and copy its parameter
into other repetitive structures. More generally, we further provide two versions: max-layer padding
(MP) and random weight padding (RP) to handle diverse network structures. Specifically, MP chooses
the layer with the most number of parameters as the prototype and uses first certain parameters of
prototype to fill in other layers. RP even breaks the constraint of network architecture. It samples a
random vector with certain length as the prototype which is copied several times to fill in all layers
based on their different lengths. RP is architecture-agnostic and can be seen as a most general strategy
in our work. Three strategies are from specific to general manner and reduce the number of unique
parameters gradually. We first employ these strategies to randomly initialize network. Then, we learn
different masks to explore the random weights potential and positively answer the scientific question
above. Leveraging on it, we propose a new network compression paradigm by using a set of random
weights with a bunch of masks to represent a network model instead of restoring sparse weights
for all layers (see Fig. 1). We conduct comprehensive experiments to explore the random weights
representative potential and test the model compression performance to validate our paradigm. We
summarize our contributions as below:
•
We scientifically explore the representative potential of fixed random weights with lim-
ited unique values and introduce our Parameter-Efficient Masking Networks (PEMN). It
leverages on learning different masks to represent different feature mappings.
•
A novel network compression paradigm is naturally proposed by fully utilizing the repre-
sentative capacity of random weights. We represent and restore a network based on a given
random vector with a bunch of masks instead of retaining all the sparse weights.
•
Extensive experimental results explore the random weights potential by using our PEMN
and test the compression performance of our new paradigm. We expect our work can inspire
more interesting explorations in this direction.
2 Related Works
2.1 Sparse Network Training
Our work is related to sparse network training. Conventional pruning techniques finetune the pruned
network from pretrained models [
13
,
14
] with various pruning criteria for different applications [
26
,
15
,
16
,
38
,
36
,
40
,
41
]. Instead of pruning a pretrained model, pruning at initialization [
37
] approaches
attempt to find winning ticket from the random weight. Gradient information is considered to build
pruning criteria in [
25
,
35
]. Different from pruning methods above, sparse network training also
can be conducted in a dynamic fashion. To name a few, Rigging the Lottery [
10
] edits the network
connections and jointly updates the learnable weights. Dynamic Sparse Reparameterization [
29
]
modifies the parameter budget among the whole network dynamically. Sparse Networks from
Scratch [
7
] proposes a momentum based approach to adaptively grow weights and empirically
2