Unmasking the Lottery Ticket Hypothesis Whats Encoded in a Winning Tickets Mask Mansheej Paul12Feng Chen1Brett W. Larsen1

2025-05-06 0 0 5.7MB 25 页 10玖币
侵权投诉
Unmasking the Lottery Ticket Hypothesis:
What’s Encoded in a Winning Ticket’s Mask?
Mansheej Paul1,2Feng Chen1Brett W. Larsen1
Jonathan Frankle3,4Surya Ganguli1,2and Gintare Karolina Dziugaite5,6
1Stanford, 2Meta AI, 3MosaicML, 4Harvard, 5Google Research, Brain Team, 6Mila; McGill
ABSTRACT
Modern deep learning involves training costly, highly overparameterized net-
works, thus motivating the search for sparser networks that require less compute
and memory but can still be trained to the same accuracy as the full network (i.e.
matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that
can find such highly sparse matching subnetworks, known as winning tickets. IMP
operates by iterative cycles of training, masking a fraction of smallest magnitude
weights, rewinding unmasked weights back to an early training point, and repeat-
ing. Despite its simplicity, the underlying principles for when and how IMP finds
winning tickets remain elusive. In particular, what useful information does an IMP
mask found at the end of training convey to a rewound network near the beginning
of training? How does SGD allow the network to extract this information? And
why is iterative pruning needed, i.e. why can’t we prune to very high sparsities
in one shot? We develop answers to these questions in terms of the geometry
of the error landscape. First, we find that—at higher sparsities—pairs of pruned
networks at successive pruning iterations are connected by a linear path with zero
error barrier if and only if they are matching. This indicates that masks found at
the end of training convey to the rewind point the identity of an axial subspace that
intersects a desired linearly connected mode of a matching sublevel set. Second,
we show SGD can exploit this information due to a strong form of robustness: it
can return to this mode despite strong perturbations early in training. Third, we
show how the flatness of the error landscape at the end of training determines a
limit on the fraction of weights that can be pruned at each iteration of IMP. This
analysis yields a new quantitative link between IMP performance and the Hes-
sian eigenspectrum. Finally, we show that the role of retraining in IMP is to find a
network with new small weights to prune. Overall, these results make progress to-
ward demystifying the existence of winning tickets by revealing the fundamental
role of error landscape geometry in the algorithms used to find them.
1 INTRODUCTION
Recent empirical advances in deep learning are rooted in massively scaling both the size of networks
and the amount of data they are trained on (Kaplan et al.,2020;Hoffmann et al.,2022). This
scale, however, comes at considerable resource costs, leading to immense interest in pruning models
(Blalock et al.,2020) and datasets (Paul et al.,2021;Sorscher et al.,2022), or both (Paul et al.,
2022). For example, when training or deploying machine learning systems on edge devices, memory
footprints and computational demands must remain small. This motivates the search for sparse
trainable networks that could potentially work within these limitations.
However, finding highly sparse, trainable networks is challenging. A state of the art—albeit com-
putationally intensive—algorithm for finding such networks is iterative magnitude pruning (IMP)
(Frankle et al.,2020). For example, on a ResNet-50 trained on ImageNet, IMP can reduce the num-
ber of weights by an order of magnitude without losing any test accuracy (Fig. 11). IMP works by
starting with a dense network that is usually pretrained for a very short amount of time. The weights
of this starting network are called the rewind point. IMP then repeatedly (1) trains this network to
Equal contribution. Correspondence to: mansheej@stanford.edu;gkdz@google.com.
1
arXiv:2210.03044v1 [cs.LG] 6 Oct 2022
Nested Subspaces
of Increasing Sparsity
Matching Error
Sublevel Set
Figure 1: Error landscape of IMP. At iteration L, IMP
trains the network from a pruned rewind point (cir-
cles), on an αLDdimensional axial subspace (col-
ored planes), to a level Lpruned solution (triangles).
The smallest (1α)fraction weights are then pruned,
yielding the level Lprojection (×s) whose 0weights
form a sparsity mask corresponding to a αL+1Ddi-
mensional axial subspace. This mask, when applied
to the rewind point, defines the level L+ 1 initializa-
tion. Thus IMP moves through a sequence of nested
axial subspaces of increasing sparsity. We find that
when IMP finds a sequence of matching pruned so-
lutions (triangles), there is no error barrier on the piecewise linear path between them. Thus the
key information contained in an IMP mask is the identity of an axial subspace that intersects the
connected matching sublevel set containing a well-performing overparameterized network.
convergence; (2) prunes the trained network by computing a mask that zeros out a fraction (typically
about 20%) of the smallest magnitude weights; (3) rewinds the nonzero weights back to their values
at the rewind point, and then commences the next iteration by training the masked network to con-
vergence. Each successive iteration yields a mask with higher sparsity. The final mask applied to the
rewind point constitutes a highly sparse trainable subnetwork called a winning ticket if it trains to
the same accuracy as the full network, i.e. is matching. See Appendix Afor an extended discussion
of related work.
Although IMP produces highly sparse matching networks, it is extremely resource intensive. More-
over, the principles underlying when and how IMP finds winning tickets remain quite mysterious.
For these reasons, the goal of this work is to develop a scientific understanding of the principles and
mechanisms governing the success or failure of IMP. By identifying such mechanisms, we hope to
facilitate the design of improved network pruning algorithms.
The operation of IMP raises four foundational puzzles. First, the mask that determines which
weights to prune is identified based on small magnitude weights at the end of training. However, at
the next iteration, this mask is applied to the rewind point found early in training. Precisely what
information from the end of training does the mask provide to the rewind point? Second, how does
SGD starting from the masked rewind point extract and use this information? The mask indeed
provides actionable information beyond that stored in the network weights at the rewind point—
using a random mask or even pruning the smallest magnitude weights at this point leads to higher
error after training (Frankle et al.,2021). Third, why are we forced to prune only a small fraction
of weights at each iteration? Training and then pruning a large fraction of weights in one shot does
not perform as well as iteratively pruning a small fraction and then retraining (Frankle & Carbin,
2018). Why does pruning a larger fraction in one iteration destroy the actionable information in
the mask? Fourth, why does retraining allow us to prune more weights? A variant of IMP that
uses a different retraining strategy (learning rate rewinding) also successfully identifies matching
subnetworks while another variant (finetuning) fails (Renda et al.,2020). What differentiates a
successful retraining strategy from an unsuccessful one?
Understanding IMP through error landscape geometry. In this work, we provide insights into
these questions by isolating important aspects of the error landscape geometry that govern IMP
performance (Fig. 1). We do so through extensive empirical investigations on a range of benchmark
datasets (CIFAR-10, CIFAR-100, and ImageNet) and modern network architectures (ResNet-20,
ResNet-18, and ResNet-50). Our contributions are as follows:
We find that, at higher sparsities, successive matching sparse solutions at level Land L+ 1
(adjacent pairs of triangles in Fig. 1)are linearly mode connected (there is no error barrier on the
line connecting them in weight space). However, the dense solution and the sparsest matching
solution may not be linearly mode connected. (Section 3.1)
The transition from matching to nonmatching sparsity coincides with the failure of this successive
level linear mode connectivity. Together, these answer our first question of what information from
the end of training is conveyed by the mask to the rewind point. It is the identity of a sparser axial
2
subspace that intersects the linearly connected sublevel set containing the current matching IMP
solution. IMP fails to find matching solutions when this information is not present. (Section 3.1)
We show that, at higher sparsities, networks trained from rewind points that yield matching solu-
tions are not only stable to SGD noise (Frankle et al.,2020), but also exhibit a stronger form of
robustness to perturbations: any two networks separated by a distance equal to that of successive
pruned rewind points (adjacent circles in Fig. 1) will train to the same linearly connected mode.
This answers our second question of how SGD extracts information from the mask: two pruned
rewind points at successive sparsity levels can navigate back to the same linearly connected mode
yielding matching solutions (adjacent triangles Fig. 1) because of SGD robustness. (Section 3.2)
We develop an approximate theory, based on Larsen et al. (2022), that determines how aggres-
sively one can prune a trained solution (triangle in Fig. 1) in one iteration in terms of the length
of the pruning projection (distance to associated ×in Fig. 1) and the pruning ratio. This theory
reveals that the Hessian eigenspectrum of the solution governs the maximal pruning ratio at that
level; flatter error landscapes allow more aggressive pruning. Our theory quantitatively matches
random pruning while we show that magnitude pruning can slightly outperform it. (Section 3.3)
We explain the superior performance of magnitude pruning by showing that this method identifies
flatter directions in the error landscape compared with random pruning, thereby allowing more
aggressive pruning per iteration. Together, these results provide new quantitative connections
between error landscape geometry and feasible IMP hyperparameters. They also answer our
third question why we need iterative pruning: one-shot pruning to high sparsities is prohibited
by the sharpness of the error landscape. (Section 3.3)
We show that a fundamental role of retraining is to reequilibriate the weights of the network,
i.e. find networks with new small weights amenable to further pruning. Successful retraining
strategies such as weight and learning rate (LR) rewinding (Renda et al.,2020) both do this while
finetuning (FT), an unsuccessful retraining strategy, does not. This answers our fourth question
regarding why retraining allows further magnitude pruning. (Section 3.4)
Overall, our results provide significant geometric insights into IMP, including when this method
breaks down and why. We hope this understanding will lead to new pruning strategies that match the
performance of IMP but with fewer resources. Indeed, a simple adaptive rule for choosing the prun-
ing ratio at each level leads to nontrivial improvements (Appendix F). Furthermore, understanding
the role of weight reequilibriation in retraining may inspire methods to achieve this property faster.
2 PROBLEM SETUP,NOTATION,AND DEFINITIONS
Let wRDbe the weights of a network and let E(w)be its test error on a classification task.
Definition 2.1 (Linear Connectivity).Two weights w,w0are ε-linearly connected if γ[0,1],
E(γ·w+ (1 γ)·w0)γ· E(w) + (1 γ)· E(w0) + ε. (2.1)
We define the error barrier between wand w0as the smallest εfor which this is true. We say two
weights are linearly mode connected if the error barrier between them is small (i.e., less than the
standard deviation across training runs).
Sparse subnetworks. Given a dense network with weights w, a sparse subnetwork has weights
mw, where m∈ {0,1}Dis a binary mask and is the element-wise product. The sparsity of a
mask mis the fraction of zeros, (1η)[0,1] . Such a mask also defines an ηD-dimensional axial
subspace spanned by coordinate axis vectors associated with weights not zeroed out by the mask.
Notation for training. For an iteration t, weights w, and number of steps t0, let At(w, t0)be the
output of training the weights wfor t0steps starting with the algorithm state (e.g. learning rate
schedule) at time t. In this notation, if w0denotes the randomly initialized weights, then ordinary
training for Tsteps produces the final weights wT=A0(w0, T ).
Iterative magnitude pruning (IMP). IMP with Weight Rewinding (IMP-WR) is described in Al-
gorithm 1(Frankle et al.,2020). Each pruning iteration is called a pruning level, and m(L)is the
mask obtained after Llevels of pruning. τdenotes the rewind step,wτthe rewind point, and 1α
denotes a fixed pruning ratio, i.e. fraction of weights removed. The algorithm is depicted schemat-
ically in Fig. 1. The axial subspace associated with mask m(L)is a colored subspace, the pruned
rewind point m(L)wτis the circle in this subspace, the level Lsolution w(L)obtained from
3
Algorithm 1: Iterative Magnitude Pruning-Weight Rewinding (IMP-WR) (Frankle et al.,2020)
1: Initialize a dense network w0Rdand a pruning mask m(0) = 1d.
2: Train w0for τsteps to wτ.Phase 1: Pre-Training
3: for L∈ {0,...,Lmax 1}do Phase 2: Mask Search
4: Train the pruned network m(L)wτto obtain a level Lsolution w(L)
5: Prune a fraction 1αof smallest magnitude nonzero weights after training:
Let m(L+1)[i] = 0 if weight iis pruned, otherwise let m(L+1)[i] = m(L)[i].
6: Train the final network m(Lmax)wτ. Measure its accuracy. Phase 3: Sparse Training
training is the triangle also in this subspace, and the level Lprojection obtained as m(L+1) w(L)
is the cross in the next, lower dimensional axial subspace. Note w(L)=Aτ(m(L)wτ, T τ).
We also study two variants of IMP with different retraining strategies in Section 3.4: IMP with LR
rewinding (IMP-LRR) and IMP with finetuning (IMP-FT) (Renda et al.,2020). In IMP-LRR, the
level L1solution, and not the rewind point, is used as the initialization for level Lretraining and
the entire LR schedule is repeated: w(L)=A0(m(L)w(L1), T ). IMP-FT is similar to IMP-LRR
but instead of repeating the entire LR schedule, we continue training at the final low LR for the same
number of steps: w(L)=AT(m(L)w(L1), T ). See Appendix Afor a discussion.
Definition 2.2. A sparse network mwis ε-matching (in error) if it achieves accuracy within ε
of that of a trained dense network wT:E(mw)≤ E(wT)+ε.
Since we always set εas the standard deviation of the error of independently trained dense networks,
we drop the εfrom our notation and simply use matching.
Definition 2.3. A (sparse or dense) network wτis stable (at iteration τ)if, with high probabil-
ity, training a pair of networks from initialization wτusing the same algorithm but with different
randomness (e.g., different minibatch order, data augmentation, random seed, etc.) produces final
weights that are linearly mode connected. The first iteration τat which a network is stable is called
the onset of linear mode connectivity (LMC).
Frankle et al. (2020) provide evidence that sparse subnetworks are matching if and only if they are
stable. Finally, playing a central role in our work is the concept of an LCS-set.
Definition 2.4. An ε-linearly connected sublevel set (LCS-set) of a network wis the set of all
weights w0that achieve the same error as wup to ε, i.e. E(w0)≤ E(w)+ε,and are linearly mode
connected to w, i.e. there are no error barriers on the line connecting wand w0in weight space.
Note that an LCS-set is star convex; not all pairs of points in the set are linearly mode connected.
As with matching, we drop εfrom the notation for LCS-set.
3 RESULTS
In the next 4subsections, we provide geometric insights into the 4main questions raised above.
3.1 PRUNING MASKS IDENTIFY AXIAL SUBSPACES THAT INTERSECT MATCHING LCS-SETS.
First, we elucidate what useful information the mask found at the end of training at level Lprovides
to the rewind point at level L+ 1. We find that when an iteration of IMP from level Lto L+ 1 finds
a matching subnetwork, the axial subspace m(L+1) obtained by pruning the level Lsolution, w(L),
intersects the LCS-set of this solution. By the definition of LCS-set, all the points in this intersection
are matching solutions in the sparser m(L+1) subspace and are linearly connected to w(L). We also
find that the network w(L+1) found by SGD is in fact one of these solutions. Conversely, when IMP
from level Lto L+ 1 does not find a matching subnetwork, the solution w(L+1) does not lie in the
LCS-set of w(L), suggesting that the axial subspace m(L+1) does not intersect this set. Thus, we
hypothesize that a round of IMP finds a matching subnetwork if and only if the sparse axial subspace
found by pruning intersects the LCS-set of the current matching solution.
Figs. 2and 3present evidence for this hypothesis. The left and center columns of Fig. 2show
that in a ResNet-50 (ResNet-20) trained on ImageNet (CIFAR-10), for rewind steps at initialization
(blue curve) or early in training (red curve), successive IMP solutions w(L)and w(L+1) are neither
matching nor linearly mode connected. However, at a later rewind point (green curve) successive
4
CIFAR-10 + ResNet-20ImageNet + ResNet-50
Figure 2: Error Connectivity of IMP. Left: The error along a piecewise linear path connecting each
level Lsolution w(L)to the L+ 1 solution w(L+1) for 3 different rewind steps. Purple curves:
rewind step 0, orange curves: early rewind steps (250 for CIFAR-10 and 1250 for ImageNet), and
green curves: rewind steps that produce matching subnetworks (2000 for CIFAR-10 and 5000 for
ImageNet). Middle: Zoomed in plot on green curve shows that networks on the piecewise linear
path between matching IMP solutions are also matching. Right column: The maximal error bar-
rier between all pairs of solutions at different levels (from the matching rewind step). The dark
blue regions indicate solutions in a matching linearly connected mode. See Fig. 12 for CIFAR-
100/ResNet-18 results.
Figure 3: Two-dimensional slices of the error landscape spanned by 3 points at each level L:(1) the
solution w(L)(grey triangle), (2) its level Lprojection m(L+1) w(L)(orange ×) onto the axial
subspace m(L+1) (purple dashed line), and (3) the level L+ 1 solution w(L+1) (orange triangle)
found by retraining with the new mask m(L+1). The axial subspace m(L+1) is obtained by 20%
magnitude pruning on w(L). The dotted black contour outlines the LCS-set of w(L). Column 2
shows a low sparsity level where the projection remains within the LCS-set; column 3 shows a
higher sparsity level where the projection is outside, but w(L+1) returns to the LCS-set. Column 4
shows a higher sparsity level, at which IMP fails to find a matching solution: both the projection and
retrained solution lie outside the LCS-set. See Fig. 13 and 14 for additional results.
matching solutions are linearly mode connected. Fig. 3visualizes two dimensional slices of the
error landscape containing the level Lsolution, its pruning projection, and the level L+ 1 solution.
We find that at early pruning levels, the projected network, m(L+1) w(L), remains in the LCS-set
of w(L). Thus the m(L+1) axial subspace intersects this set. As Lincreases, the projections leave
the LCS-set of w(L), which also shrinks in size. However, the axial subspace m(L+1) still intersects
the LCS-set of w(L)since w(L+1) lies in this set. Conversely, at the sparsity level when matching
breaks down, the axial subspace no longer intersects the LCS-set.
In summary, when IMP succeeds, i.e. w(L)and w(L+1) are both matching, the mask m(L+1)
defines an axial subspace that intersects the LCS-set of w(L). When IMP fails, this intersection also
5
摘要:

UnmaskingtheLotteryTicketHypothesis:What'sEncodedinaWinningTicket'sMask?MansheejPaul1;2FengChen1BrettW.Larsen1JonathanFrankle3;4SuryaGanguli1;2andGintareKarolinaDziugaite5;61Stanford,2MetaAI,3MosaicML,4Harvard,5GoogleResearch,BrainTeam,6Mila;McGillABSTRACTModerndeeplearninginvolvestrainingcostly,...

展开>> 收起<<
Unmasking the Lottery Ticket Hypothesis Whats Encoded in a Winning Tickets Mask Mansheej Paul12Feng Chen1Brett W. Larsen1.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:5.7MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注