Unmasking the Lottery Ticket Hypothesis Whats Encoded in a Winning Tickets Mask Mansheej Paul12Feng Chen1Brett W. Larsen1

2025-05-06 2 0 5.7MB 25 页 10玖币

侵权投诉

Unmasking the Lottery Ticket Hypothesis:

What’s Encoded in a Winning Ticket’s Mask?

Mansheej Paul1,2∗Feng Chen1∗Brett W. Larsen1∗

Jonathan Frankle3,4Surya Ganguli1,2and Gintare Karolina Dziugaite5,6

1Stanford, 2Meta AI, 3MosaicML, 4Harvard, 5Google Research, Brain Team, 6Mila; McGill

ABSTRACT

Modern deep learning involves training costly, highly overparameterized net-

works, thus motivating the search for sparser networks that require less compute

and memory but can still be trained to the same accuracy as the full network (i.e.

matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that

can ﬁnd such highly sparse matching subnetworks, known as winning tickets. IMP

operates by iterative cycles of training, masking a fraction of smallest magnitude

weights, rewinding unmasked weights back to an early training point, and repeat-

ing. Despite its simplicity, the underlying principles for when and how IMP ﬁnds

winning tickets remain elusive. In particular, what useful information does an IMP

mask found at the end of training convey to a rewound network near the beginning

of training? How does SGD allow the network to extract this information? And

why is iterative pruning needed, i.e. why can’t we prune to very high sparsities

in one shot? We develop answers to these questions in terms of the geometry

of the error landscape. First, we ﬁnd that—at higher sparsities—pairs of pruned

networks at successive pruning iterations are connected by a linear path with zero

error barrier if and only if they are matching. This indicates that masks found at

the end of training convey to the rewind point the identity of an axial subspace that

intersects a desired linearly connected mode of a matching sublevel set. Second,

we show SGD can exploit this information due to a strong form of robustness: it

can return to this mode despite strong perturbations early in training. Third, we

show how the ﬂatness of the error landscape at the end of training determines a

limit on the fraction of weights that can be pruned at each iteration of IMP. This

analysis yields a new quantitative link between IMP performance and the Hes-

sian eigenspectrum. Finally, we show that the role of retraining in IMP is to ﬁnd a

network with new small weights to prune. Overall, these results make progress to-

ward demystifying the existence of winning tickets by revealing the fundamental

role of error landscape geometry in the algorithms used to ﬁnd them.

1 INTRODUCTION

Recent empirical advances in deep learning are rooted in massively scaling both the size of networks

and the amount of data they are trained on (Kaplan et al.,2020;Hoffmann et al.,2022). This

scale, however, comes at considerable resource costs, leading to immense interest in pruning models

(Blalock et al.,2020) and datasets (Paul et al.,2021;Sorscher et al.,2022), or both (Paul et al.,

2022). For example, when training or deploying machine learning systems on edge devices, memory

footprints and computational demands must remain small. This motivates the search for sparse

trainable networks that could potentially work within these limitations.

However, ﬁnding highly sparse, trainable networks is challenging. A state of the art—albeit com-

putationally intensive—algorithm for ﬁnding such networks is iterative magnitude pruning (IMP)

(Frankle et al.,2020). For example, on a ResNet-50 trained on ImageNet, IMP can reduce the num-

ber of weights by an order of magnitude without losing any test accuracy (Fig. 11). IMP works by

starting with a dense network that is usually pretrained for a very short amount of time. The weights

of this starting network are called the rewind point. IMP then repeatedly (1) trains this network to

∗Equal contribution. Correspondence to: mansheej@stanford.edu;gkdz@google.com.

arXiv:2210.03044v1 [cs.LG] 6 Oct 2022

Nested Subspaces

of Increasing Sparsity

Matching Error

Sublevel Set

Figure 1: Error landscape of IMP. At iteration L, IMP

trains the network from a pruned rewind point (cir-

cles), on an αLDdimensional axial subspace (col-

ored planes), to a level Lpruned solution (triangles).

The smallest (1−α)fraction weights are then pruned,

yielding the level Lprojection (×’s) whose 0weights

form a sparsity mask corresponding to a αL+1Ddi-

mensional axial subspace. This mask, when applied

to the rewind point, deﬁnes the level L+ 1 initializa-

tion. Thus IMP moves through a sequence of nested

axial subspaces of increasing sparsity. We ﬁnd that

when IMP ﬁnds a sequence of matching pruned so-

lutions (triangles), there is no error barrier on the piecewise linear path between them. Thus the

key information contained in an IMP mask is the identity of an axial subspace that intersects the

connected matching sublevel set containing a well-performing overparameterized network.

convergence; (2) prunes the trained network by computing a mask that zeros out a fraction (typically

about 20%) of the smallest magnitude weights; (3) rewinds the nonzero weights back to their values

at the rewind point, and then commences the next iteration by training the masked network to con-

vergence. Each successive iteration yields a mask with higher sparsity. The ﬁnal mask applied to the

rewind point constitutes a highly sparse trainable subnetwork called a winning ticket if it trains to

the same accuracy as the full network, i.e. is matching. See Appendix Afor an extended discussion

of related work.

Although IMP produces highly sparse matching networks, it is extremely resource intensive. More-

over, the principles underlying when and how IMP ﬁnds winning tickets remain quite mysterious.

For these reasons, the goal of this work is to develop a scientiﬁc understanding of the principles and

mechanisms governing the success or failure of IMP. By identifying such mechanisms, we hope to

facilitate the design of improved network pruning algorithms.

The operation of IMP raises four foundational puzzles. First, the mask that determines which

weights to prune is identiﬁed based on small magnitude weights at the end of training. However, at

the next iteration, this mask is applied to the rewind point found early in training. Precisely what

information from the end of training does the mask provide to the rewind point? Second, how does

SGD starting from the masked rewind point extract and use this information? The mask indeed

provides actionable information beyond that stored in the network weights at the rewind point—

using a random mask or even pruning the smallest magnitude weights at this point leads to higher

error after training (Frankle et al.,2021). Third, why are we forced to prune only a small fraction

of weights at each iteration? Training and then pruning a large fraction of weights in one shot does

not perform as well as iteratively pruning a small fraction and then retraining (Frankle & Carbin,

2018). Why does pruning a larger fraction in one iteration destroy the actionable information in

the mask? Fourth, why does retraining allow us to prune more weights? A variant of IMP that

uses a different retraining strategy (learning rate rewinding) also successfully identiﬁes matching

subnetworks while another variant (ﬁnetuning) fails (Renda et al.,2020). What differentiates a

successful retraining strategy from an unsuccessful one?

Understanding IMP through error landscape geometry. In this work, we provide insights into

these questions by isolating important aspects of the error landscape geometry that govern IMP

performance (Fig. 1). We do so through extensive empirical investigations on a range of benchmark

datasets (CIFAR-10, CIFAR-100, and ImageNet) and modern network architectures (ResNet-20,

ResNet-18, and ResNet-50). Our contributions are as follows:

• We ﬁnd that, at higher sparsities, successive matching sparse solutions at level Land L+ 1

(adjacent pairs of triangles in Fig. 1)are linearly mode connected (there is no error barrier on the

line connecting them in weight space). However, the dense solution and the sparsest matching

solution may not be linearly mode connected. (Section 3.1)

• The transition from matching to nonmatching sparsity coincides with the failure of this successive

level linear mode connectivity. Together, these answer our ﬁrst question of what information from

the end of training is conveyed by the mask to the rewind point. It is the identity of a sparser axial

subspace that intersects the linearly connected sublevel set containing the current matching IMP

solution. IMP fails to ﬁnd matching solutions when this information is not present. (Section 3.1)

• We show that, at higher sparsities, networks trained from rewind points that yield matching solu-

tions are not only stable to SGD noise (Frankle et al.,2020), but also exhibit a stronger form of

robustness to perturbations: any two networks separated by a distance equal to that of successive

pruned rewind points (adjacent circles in Fig. 1) will train to the same linearly connected mode.

This answers our second question of how SGD extracts information from the mask: two pruned

rewind points at successive sparsity levels can navigate back to the same linearly connected mode

yielding matching solutions (adjacent triangles Fig. 1) because of SGD robustness. (Section 3.2)

• We develop an approximate theory, based on Larsen et al. (2022), that determines how aggres-

sively one can prune a trained solution (triangle in Fig. 1) in one iteration in terms of the length

of the pruning projection (distance to associated ×in Fig. 1) and the pruning ratio. This theory

reveals that the Hessian eigenspectrum of the solution governs the maximal pruning ratio at that

level; ﬂatter error landscapes allow more aggressive pruning. Our theory quantitatively matches

random pruning while we show that magnitude pruning can slightly outperform it. (Section 3.3)

• We explain the superior performance of magnitude pruning by showing that this method identiﬁes

ﬂatter directions in the error landscape compared with random pruning, thereby allowing more

aggressive pruning per iteration. Together, these results provide new quantitative connections

between error landscape geometry and feasible IMP hyperparameters. They also answer our

third question why we need iterative pruning: one-shot pruning to high sparsities is prohibited

by the sharpness of the error landscape. (Section 3.3)

• We show that a fundamental role of retraining is to reequilibriate the weights of the network,

i.e. ﬁnd networks with new small weights amenable to further pruning. Successful retraining

strategies such as weight and learning rate (LR) rewinding (Renda et al.,2020) both do this while

ﬁnetuning (FT), an unsuccessful retraining strategy, does not. This answers our fourth question

regarding why retraining allows further magnitude pruning. (Section 3.4)

Overall, our results provide signiﬁcant geometric insights into IMP, including when this method

breaks down and why. We hope this understanding will lead to new pruning strategies that match the

performance of IMP but with fewer resources. Indeed, a simple adaptive rule for choosing the prun-

ing ratio at each level leads to nontrivial improvements (Appendix F). Furthermore, understanding

the role of weight reequilibriation in retraining may inspire methods to achieve this property faster.

2 PROBLEM SETUP,NOTATION,AND DEFINITIONS

Let w∈RDbe the weights of a network and let E(w)be its test error on a classiﬁcation task.

Deﬁnition 2.1 (Linear Connectivity).Two weights w,w0are ε-linearly connected if ∀γ∈[0,1],

E(γ·w+ (1 −γ)·w0)≤γ· E(w) + (1 −γ)· E(w0) + ε. (2.1)

We deﬁne the error barrier between wand w0as the smallest εfor which this is true. We say two

weights are linearly mode connected if the error barrier between them is small (i.e., less than the

standard deviation across training runs).

Sparse subnetworks. Given a dense network with weights w, a sparse subnetwork has weights

mw, where m∈ {0,1}Dis a binary mask and is the element-wise product. The sparsity of a

mask mis the fraction of zeros, (1−η)∈[0,1] . Such a mask also deﬁnes an ηD-dimensional axial

subspace spanned by coordinate axis vectors associated with weights not zeroed out by the mask.

Notation for training. For an iteration t, weights w, and number of steps t0, let At(w, t0)be the

output of training the weights wfor t0steps starting with the algorithm state (e.g. learning rate

schedule) at time t. In this notation, if w0denotes the randomly initialized weights, then ordinary

training for Tsteps produces the ﬁnal weights wT=A0(w0, T ).

Iterative magnitude pruning (IMP). IMP with Weight Rewinding (IMP-WR) is described in Al-

gorithm 1(Frankle et al.,2020). Each pruning iteration is called a pruning level, and m(L)is the

mask obtained after Llevels of pruning. τdenotes the rewind step,wτthe rewind point, and 1−α

denotes a ﬁxed pruning ratio, i.e. fraction of weights removed. The algorithm is depicted schemat-

ically in Fig. 1. The axial subspace associated with mask m(L)is a colored subspace, the pruned

rewind point m(L)wτis the circle in this subspace, the level Lsolution w(L)obtained from

Algorithm 1: Iterative Magnitude Pruning-Weight Rewinding (IMP-WR) (Frankle et al.,2020)

1: Initialize a dense network w0∈Rdand a pruning mask m(0) = 1d.

2: Train w0for τsteps to wτ.Phase 1: Pre-Training

3: for L∈ {0,...,Lmax −1}do Phase 2: Mask Search

4: Train the pruned network m(L)wτto obtain a level Lsolution w(L)

5: Prune a fraction 1−αof smallest magnitude nonzero weights after training:

Let m(L+1)[i] = 0 if weight iis pruned, otherwise let m(L+1)[i] = m(L)[i].

6: Train the ﬁnal network m(Lmax)wτ. Measure its accuracy. Phase 3: Sparse Training

training is the triangle also in this subspace, and the level Lprojection obtained as m(L+1) w(L)

is the cross in the next, lower dimensional axial subspace. Note w(L)=Aτ(m(L)wτ, T −τ).

We also study two variants of IMP with different retraining strategies in Section 3.4: IMP with LR

rewinding (IMP-LRR) and IMP with ﬁnetuning (IMP-FT) (Renda et al.,2020). In IMP-LRR, the

level L−1solution, and not the rewind point, is used as the initialization for level Lretraining and

the entire LR schedule is repeated: w(L)=A0(m(L)w(L−1), T ). IMP-FT is similar to IMP-LRR

but instead of repeating the entire LR schedule, we continue training at the ﬁnal low LR for the same

number of steps: w(L)=AT(m(L)w(L−1), T ). See Appendix Afor a discussion.

Deﬁnition 2.2. A sparse network mwis ε-matching (in error) if it achieves accuracy within ε

of that of a trained dense network wT:E(mw)≤ E(wT)+ε.

Since we always set εas the standard deviation of the error of independently trained dense networks,

we drop the εfrom our notation and simply use matching.

Deﬁnition 2.3. A (sparse or dense) network wτis stable (at iteration τ)if, with high probabil-

ity, training a pair of networks from initialization wτusing the same algorithm but with different

randomness (e.g., different minibatch order, data augmentation, random seed, etc.) produces ﬁnal

weights that are linearly mode connected. The ﬁrst iteration τat which a network is stable is called

the onset of linear mode connectivity (LMC).

Frankle et al. (2020) provide evidence that sparse subnetworks are matching if and only if they are

stable. Finally, playing a central role in our work is the concept of an LCS-set.

Deﬁnition 2.4. An ε-linearly connected sublevel set (LCS-set) of a network wis the set of all

weights w0that achieve the same error as wup to ε, i.e. E(w0)≤ E(w)+ε,and are linearly mode

connected to w, i.e. there are no error barriers on the line connecting wand w0in weight space.

Note that an LCS-set is star convex; not all pairs of points in the set are linearly mode connected.

As with matching, we drop εfrom the notation for LCS-set.

3 RESULTS

In the next 4subsections, we provide geometric insights into the 4main questions raised above.

3.1 PRUNING MASKS IDENTIFY AXIAL SUBSPACES THAT INTERSECT MATCHING LCS-SETS.

First, we elucidate what useful information the mask found at the end of training at level Lprovides

to the rewind point at level L+ 1. We ﬁnd that when an iteration of IMP from level Lto L+ 1 ﬁnds

a matching subnetwork, the axial subspace m(L+1) obtained by pruning the level Lsolution, w(L),

intersects the LCS-set of this solution. By the deﬁnition of LCS-set, all the points in this intersection

are matching solutions in the sparser m(L+1) subspace and are linearly connected to w(L). We also

ﬁnd that the network w(L+1) found by SGD is in fact one of these solutions. Conversely, when IMP

from level Lto L+ 1 does not ﬁnd a matching subnetwork, the solution w(L+1) does not lie in the

LCS-set of w(L), suggesting that the axial subspace m(L+1) does not intersect this set. Thus, we

hypothesize that a round of IMP ﬁnds a matching subnetwork if and only if the sparse axial subspace

found by pruning intersects the LCS-set of the current matching solution.

Figs. 2and 3present evidence for this hypothesis. The left and center columns of Fig. 2show

that in a ResNet-50 (ResNet-20) trained on ImageNet (CIFAR-10), for rewind steps at initialization

(blue curve) or early in training (red curve), successive IMP solutions w(L)and w(L+1) are neither

matching nor linearly mode connected. However, at a later rewind point (green curve) successive

CIFAR-10 + ResNet-20ImageNet + ResNet-50

Figure 2: Error Connectivity of IMP. Left: The error along a piecewise linear path connecting each

level Lsolution w(L)to the L+ 1 solution w(L+1) for 3 different rewind steps. Purple curves:

rewind step 0, orange curves: early rewind steps (250 for CIFAR-10 and 1250 for ImageNet), and

green curves: rewind steps that produce matching subnetworks (2000 for CIFAR-10 and 5000 for

ImageNet). Middle: Zoomed in plot on green curve shows that networks on the piecewise linear

path between matching IMP solutions are also matching. Right column: The maximal error bar-

rier between all pairs of solutions at different levels (from the matching rewind step). The dark

blue regions indicate solutions in a matching linearly connected mode. See Fig. 12 for CIFAR-

100/ResNet-18 results.

Figure 3: Two-dimensional slices of the error landscape spanned by 3 points at each level L:(1) the

solution w(L)(grey triangle), (2) its level Lprojection m(L+1) w(L)(orange ×) onto the axial

subspace m(L+1) (purple dashed line), and (3) the level L+ 1 solution w(L+1) (orange triangle)

found by retraining with the new mask m(L+1). The axial subspace m(L+1) is obtained by 20%

magnitude pruning on w(L). The dotted black contour outlines the LCS-set of w(L). Column 2

shows a low sparsity level where the projection remains within the LCS-set; column 3 shows a

higher sparsity level where the projection is outside, but w(L+1) returns to the LCS-set. Column 4

shows a higher sparsity level, at which IMP fails to ﬁnd a matching solution: both the projection and

retrained solution lie outside the LCS-set. See Fig. 13 and 14 for additional results.

matching solutions are linearly mode connected. Fig. 3visualizes two dimensional slices of the

error landscape containing the level Lsolution, its pruning projection, and the level L+ 1 solution.

We ﬁnd that at early pruning levels, the projected network, m(L+1) w(L), remains in the LCS-set

of w(L). Thus the m(L+1) axial subspace intersects this set. As Lincreases, the projections leave

the LCS-set of w(L), which also shrinks in size. However, the axial subspace m(L+1) still intersects

the LCS-set of w(L)since w(L+1) lies in this set. Conversely, at the sparsity level when matching

breaks down, the axial subspace no longer intersects the LCS-set.

In summary, when IMP succeeds, i.e. w(L)and w(L+1) are both matching, the mask m(L+1)

deﬁnes an axial subspace that intersects the LCS-set of w(L). When IMP fails, this intersection also

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnmaskingtheLotteryTicketHypothesis:What'sEncodedinaWinningTicket'sMask?MansheejPaul1;2FengChen1BrettW.Larsen1JonathanFrankle3;4SuryaGanguli1;2andGintareKarolinaDziugaite5;61Stanford,2MetaAI,3MosaicML,4Harvard,5GoogleResearch,BrainTeam,6Mila;McGillABSTRACTModerndeeplearninginvolvestrainingcostly,...

展开>> 收起<<

Unmasking the Lottery Ticket Hypothesis Whats Encoded in a Winning Tickets Mask Mansheej Paul12Feng Chen1Brett W. Larsen1.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unmasking the Lottery Ticket Hypothesis Whats Encoded in a Winning Tickets Mask Mansheej Paul12Feng Chen1Brett W. Larsen1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: