subspace that intersects the linearly connected sublevel set containing the current matching IMP
solution. IMP fails to find matching solutions when this information is not present. (Section 3.1)
• We show that, at higher sparsities, networks trained from rewind points that yield matching solu-
tions are not only stable to SGD noise (Frankle et al.,2020), but also exhibit a stronger form of
robustness to perturbations: any two networks separated by a distance equal to that of successive
pruned rewind points (adjacent circles in Fig. 1) will train to the same linearly connected mode.
This answers our second question of how SGD extracts information from the mask: two pruned
rewind points at successive sparsity levels can navigate back to the same linearly connected mode
yielding matching solutions (adjacent triangles Fig. 1) because of SGD robustness. (Section 3.2)
• We develop an approximate theory, based on Larsen et al. (2022), that determines how aggres-
sively one can prune a trained solution (triangle in Fig. 1) in one iteration in terms of the length
of the pruning projection (distance to associated ×in Fig. 1) and the pruning ratio. This theory
reveals that the Hessian eigenspectrum of the solution governs the maximal pruning ratio at that
level; flatter error landscapes allow more aggressive pruning. Our theory quantitatively matches
random pruning while we show that magnitude pruning can slightly outperform it. (Section 3.3)
• We explain the superior performance of magnitude pruning by showing that this method identifies
flatter directions in the error landscape compared with random pruning, thereby allowing more
aggressive pruning per iteration. Together, these results provide new quantitative connections
between error landscape geometry and feasible IMP hyperparameters. They also answer our
third question why we need iterative pruning: one-shot pruning to high sparsities is prohibited
by the sharpness of the error landscape. (Section 3.3)
• We show that a fundamental role of retraining is to reequilibriate the weights of the network,
i.e. find networks with new small weights amenable to further pruning. Successful retraining
strategies such as weight and learning rate (LR) rewinding (Renda et al.,2020) both do this while
finetuning (FT), an unsuccessful retraining strategy, does not. This answers our fourth question
regarding why retraining allows further magnitude pruning. (Section 3.4)
Overall, our results provide significant geometric insights into IMP, including when this method
breaks down and why. We hope this understanding will lead to new pruning strategies that match the
performance of IMP but with fewer resources. Indeed, a simple adaptive rule for choosing the prun-
ing ratio at each level leads to nontrivial improvements (Appendix F). Furthermore, understanding
the role of weight reequilibriation in retraining may inspire methods to achieve this property faster.
2 PROBLEM SETUP,NOTATION,AND DEFINITIONS
Let w∈RDbe the weights of a network and let E(w)be its test error on a classification task.
Definition 2.1 (Linear Connectivity).Two weights w,w0are ε-linearly connected if ∀γ∈[0,1],
E(γ·w+ (1 −γ)·w0)≤γ· E(w) + (1 −γ)· E(w0) + ε. (2.1)
We define the error barrier between wand w0as the smallest εfor which this is true. We say two
weights are linearly mode connected if the error barrier between them is small (i.e., less than the
standard deviation across training runs).
Sparse subnetworks. Given a dense network with weights w, a sparse subnetwork has weights
mw, where m∈ {0,1}Dis a binary mask and is the element-wise product. The sparsity of a
mask mis the fraction of zeros, (1−η)∈[0,1] . Such a mask also defines an ηD-dimensional axial
subspace spanned by coordinate axis vectors associated with weights not zeroed out by the mask.
Notation for training. For an iteration t, weights w, and number of steps t0, let At(w, t0)be the
output of training the weights wfor t0steps starting with the algorithm state (e.g. learning rate
schedule) at time t. In this notation, if w0denotes the randomly initialized weights, then ordinary
training for Tsteps produces the final weights wT=A0(w0, T ).
Iterative magnitude pruning (IMP). IMP with Weight Rewinding (IMP-WR) is described in Al-
gorithm 1(Frankle et al.,2020). Each pruning iteration is called a pruning level, and m(L)is the
mask obtained after Llevels of pruning. τdenotes the rewind step,wτthe rewind point, and 1−α
denotes a fixed pruning ratio, i.e. fraction of weights removed. The algorithm is depicted schemat-
ically in Fig. 1. The axial subspace associated with mask m(L)is a colored subspace, the pruned
rewind point m(L)wτis the circle in this subspace, the level Lsolution w(L)obtained from
3