can be dropped iteratively.
Dynamic sparse training, on the other hand, starts with
a sparse neural network from scratch and only requires a
fraction of the parameters of the large, dense model, and
therefore also only a fraction of gradients, optimizer vari-
ables and FLOPs for processing them, while achieving com-
parable results. Without knowing the set of connections
upfront, however, a well-suited set of connections needs to
be found at the same time as the weights are optimized. Ran-
domly connected networks fall short in accuracy compared
to pruned models, especially at high sparsity levels. Dy-
namic sparse training solves this by training the connections
at the same time as the parameters. This is accomplished
using pruning-like concepts for weight removal, and spe-
cial heuristics for weight insertion at set intervals during
training.
Finding the best weights for removal and insertion is difficult
and often either causes significant overhead which negates
the gains of dynamic sparse training, or it fails to achieve
results close to already highly optimized pruning techniques.
A common limitation of most approaches is that weights are
only redistributed locally, on a per-layer basis. The target
sparsity of each layer is chosen heuristically at network
initialization, for example by making larger layers more
sparse and distributing the culled weights to smaller layers
which likely need them more. However, not all layers of a
network require the same density, even if their parameter
tensor has the same shape. This problem becomes more
apparent when using very high levels of sparsity (
>90
%)
where small differences in the density of a layer can make a
big difference on the accuracy of the network.
We propose Global Gradient-based Redistribution (GGR),
a method which redistributes weights throughout layers
dynamically during the training process by inserting the
weights in the layers with the largest gradients. Our main
contributions are:
•
We propose a global gradient-based weight redistribu-
tion technique that adjusts layer densities dynamically
during training.
•
We combine existing dynamic sparse training ap-
proaches to achieve the most robust and consistent
method across a variety of configurations.
•
We present ablation studies on various global weight
redistribution schemes, comparing our proposed ap-
proach to current state-of-the-art metrics.
2. Related Work
Researchers have proposed many ways to reduce overpa-
rameterization of neural networks. Typically, a network is
first trained densely until convergence before starting an
iterative pruning process. Pruning removes weights with the
lowest impact on the loss (e.g., weights with the smallest
magnitude) and the remaining weights are shortly retrained
to compensate for the loss of parameters (Han et al.,2016b;
Molchanov et al.,2017;Zhu & Gupta,2018). Training a
small, dense network with less neurons instead, however,
does not result in the same accuracy as pruning a converged
network.
The lottery ticket hypothesis suggests that when a working
subnetwork is found (e.g., with pruning), the same subnet-
work can achieve the same accuracy when trained from
scratch (Frankle & Carbin,2019). According to the authors
of the lottery ticket hypothesis, the advantage of overpa-
rameterization is that during training, the network can find
working subnetworks among large amounts of possible can-
didates. Training a found subnetwork, the winning ticket, to
achieve the same accuracy, however, is difficult and often
requires initializing the subnetwork with weights extracted
from a dense network trained for thousands of optimizer
steps (Zhou et al.,2019). However, the hypothesis shows
that at least simpler architectures can be trained with a sparse
neural network from scratch. The challenge remains to find
a combination of connections that achieves a high accuracy
- without having to train the network densely first to extract
the winning ticket.
The concept of training a randomly initialized sparse neural
networks from scratch was introduced in 2018 with two
different approaches. Deep Rewiring (DeepR) (Bellec et al.,
2018) proposes to drop weights that would flip the sign dur-
ing the optimizer update step. The same number of weights
is then inserted randomly in the same layer and initialized
with a random sign. Sparse Evolutionary Training (SET)
(Mocanu et al.,2018a), on the other hand, makes use of
a simple and well-established weight removal approach in
pruning research. Following the intuition that weights with
a smaller magnitude contribute less to the output of a neuron,
SET drops a fixed number of weights with the lowest mag-
nitude in predefined intervals and inserts the same amount
randomly. RigL (Evci et al.,2020) improved the criteria
introduced in SET. While they also remove the weights with
the lowest magnitude, they insert the set of weights with the
highest gradients instead of selecting them randomly. This
way, connections that could have the greatest impact on the
result of the current mini-batch are inserted first.
The common limitation of all three approaches is that they
use a predefined density per layer. This does have the advan-
tage that the network can be designed to reach a FLOP target
in inference. However, with deep neural networks, selecting
the right density per layer is difficult and performance is
likely left on the table.
Dynamic sparse reparameterization (DSR) (Mostafa &
2