
Why Random Pruning Is All We Need to Start Sparse
the existence of strong lottery tickets (SLTs), which as-
sumes that we have to approximate the target parameters
by pruning the sparse random source network. Up to our
knowledge, we are the first to provide experimental and
theoretical evidence for the feasibility of this case.
Background, Notation, and Proof-Setup Let x=
(x1, x2, .., xd)∈[a1, b1]dbe a bounded d-dimensional in-
put vector, where a1, b1∈Rwith a1< b1.f:[a1, b1]d→
RnLis a fully-connected feed forward neural network with
architecture (n0, n1, .., nL), i.e., depth Land nlneurons
in Layer l. Every layer l∈ {1,2, .., L}computes neuron
states x(l)=ϕh(l),h(l)=W(l−1)x(l−1) +b(l−1).
h(l)is called the pre-activation, W(l)∈Rnl×nl−1is the
weight matrix and b(l)is the bias vector. We also write
f(x;θ)to emphasize the dependence of the neural network
on its parameters θ= (W(l),b(l))L
l=1. For simplicity, we
restrict ourselves to the common ReLU ϕ(x) = max{x, 0}
activation function, but most of our results can be eas-
ily extended to more general activation functions as in
(Burkholz,2022b;a). In addition to fully-connected layers,
we also consider convolutional layers. For a convenient
notation, without loss of generality, we flatten the weight
tensors so that W(l)
T∈Rcl×cl−1×klwhere cl, cl−1, klare
the output channels, input channels and filter dimension re-
spectively. For instance, a 2-dimensional convolution on
image data would result in kl=k′
1,lk′
2,l, where k′
1,l, k′
2,l
define the filter size.
We distinguish three kinds of neural networks, a target
network fT, a source network fS, and a subnetwork fP
of fS.fTis approximated or exactly represented by fP,
which is obtained by masking the parameters of the source
fS.fSis said to contain a SLT if this subnetwork does
not require further training after obtaining the mask (by
pruning). We assume that fThas depth Land parameters
W(l)
T,b(l)
T, nT,l, mT,lare the weight, bias, number of
neurons and number of nonzero parameters of the weight
matrix in Layer l∈ {1,2, .., L}. Note that this implies
ml≤nlnl−1. Similarly, fShas depth L+ 1 with param-
eters W(l)
S,b(l)
S, nS,l, mS,lL
l=0. Note that lranges from
0to Lfor the source network, while it only ranges from
1to Lfor the target network. The extra source network
layer l= 0 accounts for an extra layer that we need in our
construction to prove existence.
ER Networks Even though common, the terminology ’ran-
dom network‘ is imprecise with respect to the random dis-
tribution from which a graph is drawn. In line with gen-
eral graph theory, we therefore use the term Erd¨
os-R´
enyi
(ER) (Erdos et al.,1960) network in the following. An
ER neural network fER ∈ER(p)is characterized by lay-
erwise sparsity ratios pl. An ER source fER is defined as
a subnetwork of a complete source network using a binary
mask S(l)
ER ∈ {0,1}nl×nl−1or S(l)
ER ∈ {0,1}nl×nl−1×kl
for every layer. The mask entries are drawn from indepen-
dent Bernoulli distributions with layerwise success proba-
bility pl>0, i.e., s(l)
ij,ER ∼Ber(pl). The random pruning
is performed initially with negligible computational over-
head and the mask stays fixed during training. Note that pl
is also the expected density of that layer. The overall ex-
pected density of the network is given as p=Plmlpl
Pkmk=
1−sparsity. In case of uniform sparsity, pl=p, we
also write ER(p)instead of ER(p). An ER network is de-
fined as fER =fS(x;W·SER). Different to conventional
SLT existence proofs (Ramanujan et al.,2020), we refer to
fER ∈ER(p)as the source network, and show that the SLT
is contained within this ER network. The SLT is then de-
fined by the mask SP, which is a subnetwork of SER, i.e., a
zero entry sij,ER = 0 implies also a zero in sij,P= 0, but
the converse is not true. We skip the subscripts if the nature
of the mask is clear from the context. In the following anal-
ysis of expressiveness in ER networks, we continue to use
of SER and SPto denote a random ER source network and
a sparse subnetwork within the ER network respectively.
Sparsity Ratios There are plenty of reasonable choices for
the layerwise sparsity ratios and thus ER probabilities pl.
Our theory applies to all of them. The optimal choice for
a given source network architecture depends on the target
network and thus the solution to a learning problem, which
is usually unknown a-priori in practice. To demonstrate
that our theory holds for different approaches, we investi-
gate the following layerwise sparsity ratios in experiments.
The simplest baseline is a globally uniform choice pl=p.
Liu et al. (2021) have compared this choice in extensive
experiments with their main proposal, ERK, which assigns
pl∝nin+nout
ninnout to a linear and pl∝cl+cl−1+kl
clcl−1kl(Mocanu
et al.,2017) to a convolutional layer. In addition, we pro-
pose a pyramidal and balanced approach, which are visu-
alized in Appendix A.15.
Pyramidal: This method emulates a property of pruned net-
works that are obtained by IMP (Frankle & Carbin,2019)
i.e. the layer densities decay with increasing depth of the
network. For a network of depth L, we use pl= (p1)l, pl∈
(0,1) so that Pl=L
l=1 plml
Pl=L
l=1 ml=p. Given the architecture, we
use a polynomial equation solver (Harris et al.,2020) to
obtain p1for the first layer such that p1∈(0,1).
Balanced: The second layerwise sparsity method aims to
maintain the same number of parameters in every layer for
a given network sparsity pand source network architec-
ture. Each neuron has a similar in- and out-degree on aver-
age. Every layer has x=p
LPl=L
l=1 mlnonzero parameters.
Such an ER network can be realized with pl=x/ml. In
case that x≥ml, we set pl= 1.
4