Why Random Pruning Is All We Need to Start Sparse

2025-04-15 1 0 1.18MB 29 页 10玖币

侵权投诉

Advait Gadhikar 1Sohom Mukherjee 1Rebekka Burkholz 1

Abstract

Random masks deﬁne surprisingly effective

sparse neural network models, as has been shown

empirically. The resulting sparse networks can

often compete with dense architectures and state-

of-the-art lottery ticket pruning algorithms, even

though they do not rely on computationally ex-

pensive prune-train iterations and can be drawn

initially without signiﬁcant computational over-

head. We offer a theoretical explanation of how

random masks can approximate arbitrary target

networks if they are wider by a logarithmic factor

in the inverse sparsity 1/log(1/sparsity). This

overparameterization factor is necessary at least

for 3-layer random networks, which elucidates

the observed degrading performance of random

networks at higher sparsity. At moderate to

high sparsity levels, however, our results imply

that sparser networks are contained within ran-

dom source networks so that any dense-to-sparse

training scheme can be turned into a computa-

tionally more efﬁcient sparse to sparse one by

constraining the search to a ﬁxed random mask.

We demonstrate the feasibility of this approach

in experiments for different pruning methods and

propose particularly effective choices of initial

layer-wise sparsity ratios of the random source

network. As a special case, we show theoreti-

cally and experimentally that random source net-

works also contain strong lottery tickets. Our

code is available at https://github.com/

RelationalML/sparse_to_sparse.

1. Introduction

The impressive breakthroughs achieved by deep learn-

ing have largely been attributed to the extensive over-

parametrization of deep neural networks, as it seems to

*Equal contribution 1CISPA Helmholtz Center for Informa-

tion Security, Saarbr¨

ucken, Germany. Correspondence to: Advait

Gadhikar <advait.gadhikar@cispa.de>.

Proceedings of the 40 th International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

have multiple beneﬁts for their representational power and

optimization (Belkin et al.,2019). The resulting trend to-

wards ever larger models and datasets, however, imposes

increasing computational and energy costs that are difﬁcult

to meet. This raises the question: Is this high degree of

overparameterization truly necessary?

Training general small-scale or sparse deep neural network

architectures from scratch remains a challenge for stan-

dard initialization schemes (Li et al.,2016;Han et al.,

2015). However, (Frankle & Carbin,2019) have recently

demonstrated that there exist sparse architectures that can

be trained to solve standard benchmark problems competi-

tively. According to their Lottery Ticket Hypothesis (LTH),

dense randomly initialized networks contain subnetworks

that can be trained in isolation to a test accuracy that is com-

parable with the one of the original dense network. Such

subnetworks, the lottery tickets (LTs), have since been ob-

tained by pruning algorithms that require computationally

expensive pruning-retraining iterations (Frankle & Carbin,

2019;Tanaka et al.,2020) or mask learning procedures

(Savarese et al.,2020;Sreenivasan et al.,2022b). While

these can lead to computational gains at training and infer-

ence time and reduce memory requirements (Hassibi et al.,

1993;Han et al.,2015), the real goal remains to identify

sparse trainable architectures before training, as this could

lead to signiﬁcant computational savings. Yet, contempo-

rary pruning at initialization approaches (Lee et al.,2018;

Wang et al.,2020;Tanaka et al.,2020;Fischer & Burkholz,

2022;Frankle et al.,2021) achieve less competitive perfor-

mance. For that reason it is so remarkable that even iter-

ative state-of-the-art approaches struggle to outperform a

simple, computationally cheap, and data independent al-

ternative: random pruning at initialization (Su et al.,2020).

Liu et al. (2021) have provided systematic experimental ev-

idence for its ’unreasonable‘ effectiveness in multiple set-

tings, including complex, large scale architectures and data.

We explain theoretically why they can be effective by

proving that a randomly masked network can approximate

an arbitrary target network if it is wider by a logarith-

mic factor in its sparsity 1/log(1/sparsity). By deriv-

ing a lower bound on the required width of a random 1-

hidden layer network, we further show that this degree

of overparameterization is necessary in general. This im-

plies that sparse random networks have the universal func-

arXiv:2210.02412v2 [cs.LG] 31 May 2023

Why Random Pruning Is All We Need to Start Sparse

tion approximation property like dense networks and are

at least as expressive as potential target networks. How-

ever, it also highlights the limitations of random pruning

in case of extremely high sparsities, as the width require-

ment scales then approximately as 1/log(1/sparsity)≈

1/(1 −sparsity)(see also Fig. 2for an example). In prac-

tice, we observe a similar degradation in performance for

high sparsity levels.

Even for moderate to high sparsities, the randomness of

the connections result in a considerable number of excess

weights that are not needed for the representation of a tar-

get network. This insight suggests that, on the one hand,

additional pruning could further enhance the sparsity of

the resulting neural network structure, as random masks

are likely not optimally sparse. On the other hand, any

dense-to-sparse training approach would not need to start

from a dense network but could also start training from a

sparser random network and thus be turned into a sparse-

to-sparse learning method. The main idea is visualized in

Figure 1. Sparse training with randomly masked (ER) networks:

A visual representation of the main implication of our theory -

sparse to sparse training can be effective by starting from a ran-

domly masked (ER) network.

Fig. 1and veriﬁed in extensive experiments with differ-

ent lottery ticket pruning and continuous sparsiﬁcation ap-

proaches. Our main results could also be interpreted as the-

oretical justiﬁcation for Dynamic Sparse Training (DST)

(Evci et al.,2020;Liu et al.,2021.;Bellec et al.,2018),

which prunes random networks of moderate sparsity. How-

ever, it further relies on edge rewiring steps that sometimes

require the computation of gradients of the corresponding

dense network (Evci et al.,2020). Our derived limitations

of random pruning indicate that this rewiring might be nec-

essary at extreme sparsities but likely not for moderately

sparse random starting points, as we also highlight in addi-

tional experiments.

As a special case of the main idea to prune random net-

works, we also consider strong lottery tickets (SLTs) (Zhou

et al.,2019;Ramanujan et al.,2020). These are sub-

networks of large, randomly initialized source networks,

which do not require any further training after pruning.

Theoretical (Malach et al.,2020;Pensia et al.,2020;Fis-

cher et al.,2021;da Cunha et al.,2022;Burkholz,2022a;b;

Burkholz et al.,2022) as well as empirical Ramanujan

et al. (2020); Zhou et al. (2019); Diffenderfer & Kailkhura

(2021); Sreenivasan et al. (2022a) existence proofs so far

have solely focused on pruning dense source networks. We

highlight the potential for computational resource savings

in the search for SLTs by proving their existence within

sparse random networks instead. The main component of

our results is Lemma 2.2, which extends subset sum ap-

proximations to the sparse random graph setting. This en-

ables the direct transfer of most SLT existence results for

different architectures and activation functions to sparse

source networks. Furthermore, we modify the algorithm

edge-popup (EP) (Ramanujan et al.,2020) to ﬁnd SLTs ac-

cordingly which leads to the ﬁrst sparse-to-sparse pruning

approach for SLTs, up to our knowledge. We demonstrate

in experiments that starting even at sparsities as high as 0.8

does not hamper the overall performance of EP.

Note that our general theory applies to any layerwise spar-

sity ratios of the random source network and we validate

this fact in various experiments on standard benchmark im-

age data and commonly used neural network architectures,

complementing results by Liu et al. (2021) for additional

choices of sparsity ratios. Our two proposals, balanced

and pyramidal sparsity ratios, seem to perform competi-

tively across multiple settings, especially, at higher sparsity

regimes.

Contributions

1. We prove that randomly pruned random networks are

sufﬁciently expressive and can approximate an arbi-

trary target network if they are wider by a factor of

1/log(1/sparsity). This overparametrization factor is

necessary in general, as our lower bound for univariate

target networks indicates.

2. Inspired by our proofs, we empirically demonstrate

that, without signiﬁcant loss in performance, starting

any dense-to-sparse training scheme can be translated

into a sparse-to-sparse one by starting from a random

source network instead of a dense one.

3. As a special case, we also prove the existence of

Strong Lottery Tickets (SLTs) within sparse random

source networks, if the source network is wider than a

target by a factor 1/log(1/sparsity). Our modiﬁcation

of the edge-popup (EP) algorithm (Ramanujan et al.,

2020) leads to the ﬁrst sparse-to-spare SLT pruning

method, which validates our theory and highlights po-

tential for computational savings.

4. To demonstrate that our theory applies to various

choices of sparsity ratios, we introduce two addi-

tional proposals that outperform state-of-the-art ones

Why Random Pruning Is All We Need to Start Sparse

on multiple benchmarks and are thus promising can-

didates for starting points of sparse-to-sparse learning

schemes.

1.1. Related Work

Algorithms to prune neural networks for unstructured spar-

sity can be broadly categorized into two groups, pruning

after training and pruning before (or during) training. The

ﬁrst group of algorithms that prune after training are effec-

tive in speeding up inference, but they still rely on a com-

putationally expensive training procedure (Hassibi et al.,

1993;LeCun et al.,1989;Molchanov et al.,2016;Dong

et al.,2017;Yu et al.,2022). The second group of al-

gorithms prune at initialization (Lee et al.,2018;Wang

et al.,2020;Tanaka et al.,2020;Sreenivasan et al.,2022b;

de Jorge et al.,2020) or follow a computationally expensive

cycle of pruning and retraining for multiple iterations (Gale

et al.,2019;Savarese et al.,2020;You et al.,2019;Fran-

kle & Carbin,2019;Renda et al.,2019). These methods

ﬁnd trainable subnetworks also known as Lottery Tickets

(Frankle & Carbin,2019). Single shot pruning approaches

are computationally cheaper but are susceptible to prob-

lems like layer collapse which render the pruned network

untrainable (Lee et al.,2018;Wang et al.,2020). Tanaka

et al. (2020) address this issue by preserving ﬂow in the

network through their scoring mechanism. The best per-

forming sparse networks are still obtained by expensive it-

erative pruning methods like Iterative Magnitude Pruning

(IMP), Iterative Synﬂow (Frankle & Carbin,2019;Fischer

& Burkholz,2022) or continuous sparsiﬁcation methods

(Sreenivasan et al.,2022b;Savarese et al.,2020;Kusupati

et al.,2020;Louizos et al.,2018).

However, Su et al. (2020) found that randomly pruned

masks can outperform expensive iterative pruning strate-

gies in different situations. Inspired by this ﬁnding, Gol-

ubeva et al. (2021); Chang et al. (2021) have hypothesized

that sparse overparameterized networks are more effective

than smaller networks with the same number of parameters.

Liu et al. (2021) have further demonstrated the competi-

tiveness of random masks for different data independent

choices of layerwise sparsity ratios across a wide range of

neural network architectures and datasets, including com-

plex ones. Our analysis identiﬁes the conditions under

which the effectiveness of random masks is reasonable. We

show that a sparse random source network can approximate

a target network if it is wider by a factor proportional to the

inverse log sparsity. Complementing experiments by Liu

et al. (2021), we highlight that random masks are compet-

itive for various choices of layerwise sparsity ratios. How-

ever, we also show that their randomness also likely in-

duces potential for further pruning.

We build on the lottery ticket existence theory (Malach

et al.,2020;Pensia et al.,2020;Orseau et al.,2020;Fis-

cher et al.,2021;Burkholz et al.,2022;Burkholz,2022b;

Ferbach et al.,2022) to prove that sparse random source

networks actually contain strong lottery tickets (SLTs) if

their width exceeds a value that is proportional to the width

of a target network. This theory has been inspired by exper-

imental evidence for SLTs (Ramanujan et al.,2020;Zhou

et al.,2019;Diffenderfer & Kailkhura,2021;Sreenivasan

et al.,2022a). The underlying algorithm edge-popup (Ra-

manujan et al.,2020) ﬁnds SLTs by training scores for each

parameter of the dense source network and is thus computa-

tionally as expensive as dense training. We show that train-

ing smaller random sparse source networks is sufﬁcient,

thus, reducing effectively the computational requirements

for ﬁnding SLTs.

However, our theory suggests that random ER networks

face a fundamental limitation at extreme sparsities, as

the overparameterization factor scales in this regime as

1/log(1/(sparsity)) ≈1/(1 −sparsity). This shortcom-

ing could be potentially addressed by targeted rewiring of

random edges with Dynamical Sparse Training (DST) that

starts pruning from an ER network (Liu et al.,2021.;Mo-

canu et al.,2018;Yuan et al.,2021). So far, sparse-to-

sparse training methods like Evci et al. (2020); Dettmers &

Zettlemoyer (2019) still require dense gradients for there

edge rewiring operation. Zhou et al. (2021) obtain sparse

training by estimating a sparse gradient using two forward

passes. We empirically show that in light of the expressive

power of random networks, we can also achieve sparse-to-

sparse training by simply constraining any pruning method

or gradient to a ﬁxed initial sparse random mask.

2. Expressiveness of Random Networks

Our theoretical investigations of the next section have the

purpose to explain why the effectiveness of random net-

works is reasonable given their high expressive power. We

show that we can approximate any target network with the

help of a random network, provided that it is wider by a

logarithmic factor in the inverse sparsity. First, the only

constraint that we face in our explicit construction of a rep-

resentative subnetwork is that edges are randomly available

or unavailable. But we can choose the remaining network

parameters, i.e., the weights and biases, in such a way that

we can optimally represent a target network. As common

in results on expressiveness and representational power,

we make statements about the existence of such parame-

ters, not necessarily, if they can be found algorithmically.

In practice, the parameters would usually be identiﬁed by

standard neural network training or prune-train iterations.

Our experiments validate that this is actually feasible in ad-

dition to plenty of other experimental evidence (Su et al.,

2020;Ma et al.,2021;Liu et al.,2021). Second, we prove

Why Random Pruning Is All We Need to Start Sparse

the existence of strong lottery tickets (SLTs), which as-

sumes that we have to approximate the target parameters

by pruning the sparse random source network. Up to our

knowledge, we are the ﬁrst to provide experimental and

theoretical evidence for the feasibility of this case.

Background, Notation, and Proof-Setup Let x=

(x1, x2, .., xd)∈[a1, b1]dbe a bounded d-dimensional in-

put vector, where a1, b1∈Rwith a1< b1.f:[a1, b1]d→

RnLis a fully-connected feed forward neural network with

architecture (n0, n1, .., nL), i.e., depth Land nlneurons

in Layer l. Every layer l∈ {1,2, .., L}computes neuron

states x(l)=ϕh(l),h(l)=W(l−1)x(l−1) +b(l−1).

h(l)is called the pre-activation, W(l)∈Rnl×nl−1is the

weight matrix and b(l)is the bias vector. We also write

f(x;θ)to emphasize the dependence of the neural network

on its parameters θ= (W(l),b(l))L

l=1. For simplicity, we

restrict ourselves to the common ReLU ϕ(x) = max{x, 0}

activation function, but most of our results can be eas-

ily extended to more general activation functions as in

(Burkholz,2022b;a). In addition to fully-connected layers,

we also consider convolutional layers. For a convenient

notation, without loss of generality, we ﬂatten the weight

tensors so that W(l)

T∈Rcl×cl−1×klwhere cl, cl−1, klare

the output channels, input channels and ﬁlter dimension re-

spectively. For instance, a 2-dimensional convolution on

image data would result in kl=k′

1,lk′

2,l, where k′

1,l, k′

2,l

deﬁne the ﬁlter size.

We distinguish three kinds of neural networks, a target

network fT, a source network fS, and a subnetwork fP

of fS.fTis approximated or exactly represented by fP,

which is obtained by masking the parameters of the source

fS.fSis said to contain a SLT if this subnetwork does

not require further training after obtaining the mask (by

pruning). We assume that fThas depth Land parameters

W(l)

T,b(l)

T, nT,l, mT,lare the weight, bias, number of

neurons and number of nonzero parameters of the weight

matrix in Layer l∈ {1,2, .., L}. Note that this implies

ml≤nlnl−1. Similarly, fShas depth L+ 1 with param-

eters W(l)

S,b(l)

S, nS,l, mS,lL

l=0. Note that lranges from

0to Lfor the source network, while it only ranges from

1to Lfor the target network. The extra source network

layer l= 0 accounts for an extra layer that we need in our

construction to prove existence.

ER Networks Even though common, the terminology ’ran-

dom network‘ is imprecise with respect to the random dis-

tribution from which a graph is drawn. In line with gen-

eral graph theory, we therefore use the term Erd¨

os-R´

enyi

(ER) (Erdos et al.,1960) network in the following. An

ER neural network fER ∈ER(p)is characterized by lay-

erwise sparsity ratios pl. An ER source fER is deﬁned as

a subnetwork of a complete source network using a binary

mask S(l)

ER ∈ {0,1}nl×nl−1or S(l)

ER ∈ {0,1}nl×nl−1×kl

for every layer. The mask entries are drawn from indepen-

dent Bernoulli distributions with layerwise success proba-

bility pl>0, i.e., s(l)

ij,ER ∼Ber(pl). The random pruning

is performed initially with negligible computational over-

head and the mask stays ﬁxed during training. Note that pl

is also the expected density of that layer. The overall ex-

pected density of the network is given as p=Plmlpl

Pkmk=

1−sparsity. In case of uniform sparsity, pl=p, we

also write ER(p)instead of ER(p). An ER network is de-

ﬁned as fER =fS(x;W·SER). Different to conventional

SLT existence proofs (Ramanujan et al.,2020), we refer to

fER ∈ER(p)as the source network, and show that the SLT

is contained within this ER network. The SLT is then de-

ﬁned by the mask SP, which is a subnetwork of SER, i.e., a

zero entry sij,ER = 0 implies also a zero in sij,P= 0, but

the converse is not true. We skip the subscripts if the nature

of the mask is clear from the context. In the following anal-

ysis of expressiveness in ER networks, we continue to use

of SER and SPto denote a random ER source network and

a sparse subnetwork within the ER network respectively.

Sparsity Ratios There are plenty of reasonable choices for

the layerwise sparsity ratios and thus ER probabilities pl.

Our theory applies to all of them. The optimal choice for

a given source network architecture depends on the target

network and thus the solution to a learning problem, which

is usually unknown a-priori in practice. To demonstrate

that our theory holds for different approaches, we investi-

gate the following layerwise sparsity ratios in experiments.

The simplest baseline is a globally uniform choice pl=p.

Liu et al. (2021) have compared this choice in extensive

experiments with their main proposal, ERK, which assigns

pl∝nin+nout

ninnout to a linear and pl∝cl+cl−1+kl

clcl−1kl(Mocanu

et al.,2017) to a convolutional layer. In addition, we pro-

pose a pyramidal and balanced approach, which are visu-

alized in Appendix A.15.

Pyramidal: This method emulates a property of pruned net-

works that are obtained by IMP (Frankle & Carbin,2019)

i.e. the layer densities decay with increasing depth of the

network. For a network of depth L, we use pl= (p1)l, pl∈

(0,1) so that Pl=L

l=1 plml

Pl=L

l=1 ml=p. Given the architecture, we

use a polynomial equation solver (Harris et al.,2020) to

obtain p1for the ﬁrst layer such that p1∈(0,1).

Balanced: The second layerwise sparsity method aims to

maintain the same number of parameters in every layer for

a given network sparsity pand source network architec-

ture. Each neuron has a similar in- and out-degree on aver-

age. Every layer has x=p

LPl=L

l=1 mlnonzero parameters.

Such an ER network can be realized with pl=x/ml. In

case that x≥ml, we set pl= 1.

Why Random Pruning Is All We Need to Start Sparse

2.1. General Expressiveness of ER Networks

Our main goal in this section is to derive probabilistic state-

ments about the existence of edges in an ER source network

that enable us to approximate a given target network. As

every connection in the source network only exists with a

probability pl, for each target weight, we need to create

multiple candidate edges, of which at least one is nonzero

with high enough probability. This can be achieved by

ensuring that each target edge has multiple potential start-

ing points in the ER source network. Our construction re-

alizes this idea with multiple copies of each neuron in a

layer. The required number of neuron copies depends on

the sparsity of the ER source network and introduces an

overparametrization factor pertaining to the width of the

network. To create multiple copies of input neurons as

well, our construction relies on one additional layer in the

source network in comparison with a target network, as vi-

sualized in Fig. 4in the Appendix. We ﬁrst explain the

construction for a single target layer and extend it after-

wards to deeper architectures.

Single Hidden Layer Targets We start with constructing

a single hidden layer fully-connected target network with a

subnetwork of a random ER source network that consists of

one more layer. Our proof strategy is visually explained by

Fig. 4in the Appendix. The following theorem states the

precise width requirement that our construction requires.

Theorem 2.1 (Single Hidden Layer Target Construction).

Assume that a single hidden-layer fully-connected target

network fT(x) = W(2)

Tϕ(W(1)

Tx+b(1)

T) + b(2)

T, an al-

lowed failure probability δ∈(0,1), source densities p

and a 2-layer ER source network fS∈ER(p)with widths

nS,0=q0d, nS,1=q1nT,1, nS,2=q2nT,2are given. If

q0≥1

log(1/(1 −p1)) log 2mT,1q1

δ,

q1≥1

log(1/(1 −p2)) log 2mT,2

δand q2= 1

then with probability 1−δ, the random source network fS

contains a subnetwork SPsuch that fS(x,W·SP) = fT.

Proof Outline:

The key idea is to create multiple copies (blocks in

Fig. 4(b) in the Appendix) in the source network for each

target neuron such that every target link is realized by point-

ing to at least one of these copies in the ER source. To

create multiple candidates of input neurons, we create an

univariate ﬁrst layer in the source network as explained

in Fig. 4. In the appendix, we derive the corresponding

weight and bias parameters of the source network so that it

can represent the target network exactly. Naturally, many

of the available links will receive zero weights if they are

not needed in the speciﬁc construction but are required

for a high enough probability that at least one weight can

be set to nonzero. Our main task in the proof is to es-

timate the probability that we can ﬁnd representatives of

all target links in the ER source network, i.e., every neu-

ron in Layer l= 1 has at least one edge to every block

in l= 0 of size q0, as shown in Fig. 4(b). This proba-

bility is given by (1 −(1 −p1)q0)mT ,1q1. For the second

layer, we repeat a similar argument to bound the proba-

bility (1 −(1 −p2)q1)mT,2with q2= 1, since we do not

require multiple copies of the output neurons. Bounding

this probability by 1−δcompletes the proof, as detailed in

Appendix A.3.

Deep Target Networks Theorem 2.1 shows that q0and

q1depend on 1/log(1/sparsity). We now generalize the

idea to create multiple copies of target neurons in every

layer to a fully connected network of depth L(proofs are

in Appendix A.4) and convolutional networks of depth L

as stated in Appendix A.5, which yields a similar result as

above. The additional challenge of the extension is to han-

dle the dependencies of layers, as the construction of every

layer needs to be feasible.

Theorem 2.2 (ER networks can represent L-layer target

networks.).Given a fully-connected target network fTof

depth L,δ∈(0,1), source densities pand a L+ 1-layer

ER source network fS∈ER(p)with widths nS,0=q0d

and nS,l =qlnT,l, l ∈ {1,2, .., L}, where

ql≥1

log(1/(1 −pl+1)) log LmT,l+1ql+1

δ

for l∈ {0,1, .., L −1}and qL= 1,

then with probability 1−δthe random source network fS

contains a subnetwork SPsuch that fS(x,W·SP) = fT.

Lower Bound on Overparameterization While our ex-

istence results prove that ER networks have the univer-

sal function approximation property like dense neural net-

works, in order to achieve that, our construction requests

a considerable amount of overparametrization in compari-

son with a dense target network. In particularly extremely

sparse ER networks seem to face a natural limitation, since

for sparsities 1−p≥0.9, the overparameterization factor

scales approximately as 1/log(1/(1−p)) ≈1/p. Fig. 2vi-

sualizes how this scaling becomes problematic for increas-

ing sparsity. The next theorem establishes that, unfortu-

nately, we cannot expect to get around this 1/log(1/(1 −

pl)) limitation.

Theorem 2.3 (Lower bound on Overparametrization in

ER Networks).There exist univariate target networks

fT(x) = ϕ(wT

Tx+bT)that cannot be represented by a

random 1-hidden-layer ER source network fS∈ER(p)

with probability at least 1−δ, if its width is nS,1<

log(1/(1−p)) log 1

1−(1−δ)1/d .

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhyRandomPruningIsAllWeNeedtoStartSparseAdvaitGadhikar1SohomMukherjee1RebekkaBurkholz1AbstractRandommasksdefinesurprisinglyeffectivesparseneuralnetworkmodels,ashasbeenshownempirically.Theresultingsparsenetworkscanoftencompetewithdensearchitecturesandstate-of-the-artlotteryticketpruningalgorithms,eve...

展开>> 收起<<

Why Random Pruning Is All We Need to Start Sparse.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Why Random Pruning Is All We Need to Start Sparse

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: