SGD with Large Step Sizes Learns Sparse Features

2025-05-03 1 0 2.13MB 23 页 10玖币

侵权投诉

Maksym Andriushchenko 1Aditya Varre 1Loucas Pillaud-Vivien 1Nicolas Flammarion 1

Abstract

We showcase important features of the dynamics

of the Stochastic Gradient Descent (SGD) in the

training of neural networks. We present empiri-

cal observations that commonly used large step

sizes (i) may lead the iterates to jump from one

side of a valley to the other causing loss stabiliza-

tion, and (ii) this stabilization induces a hidden

stochastic dynamics that biases it implicitly to-

ward sparse predictors. Furthermore, we show

empirically that the longer large step sizes keep

SGD high in the loss landscape valleys, the bet-

ter the implicit regularization can operate and

ﬁnd sparse representations. Notably, no explicit

regularization is used: the regularization effect

comes solely from the SGD dynamics inﬂuenced

by the large step sizes schedule. Therefore, these

observations unveil how, through the step size

schedules, both gradient and noise drive together

the SGD dynamics through the loss landscape of

neural networks. We justify these ﬁndings theo-

retically through the study of simple neural net-

work models as well as qualitative arguments in-

spired from stochastic processes. This analysis

allows us to shed new light on some common prac-

tices and observed phenomena when training deep

networks. The code of our experiments is avail-

able at

https://github.com/tml-epfl/

sgd-sparse-features.

1. Introduction

Deep neural networks have accomplished remarkable

achievements on a wide variety of tasks. Yet, the understand-

ing of their remarkable effectiveness remains incomplete.

From an optimization perspective, stochastic training proce-

dures challenge many insights drawn from convex models.

Notably, large step size schedules used in practice lead to

unexpected patterns of stabilizations and sudden drops in

EPFL. Correspondence to: Maksym Andriushchenko

<maksym.andriushchenko@epﬂ.ch>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

the training loss (He et al.,2016). From a generalization per-

spective, overparametrized deep nets generalize well while

ﬁtting perfectly the data and without any explicit regulariz-

ers (Zhang et al.,2017). This suggests that optimization and

generalization are tightly intertwined: neural networks ﬁnd

solutions that generalize well thanks to the optimization pro-

cedure used to train them. This property, known as implicit

bias or algorithmic regularization, has been studied both

for regression (Li et al.,2018;Woodworth et al.,2020) and

classiﬁcation (Soudry et al.,2018;Lyu & Li,2020;Chizat &

Bach,2020). However, for these theoretical results, it is also

shown that typical timescales needed to enter the beneﬁcial

feature learning regimes are prohibitively long (Woodworth

et al.,2020;Moroshko et al.,2020).

In this paper, we aim at staying closer to the experimental

practice and consider the SGD schedules from the ResNet

paper (He et al.,2016) where the large step size is ﬁrst

kept constant and then decayed, potentially multiple times.

We illustrate this behavior in Fig. 1where we reproduce a

minimal setting without data augmentation or momentum,

and with only one step size decrease. We draw attention

to two key observations regarding the large step size phase:

(a) quickly after the start of training, the loss remains ap-

proximately constant on average and (b) despite no progress

on the training loss, running this phase for longer leads to

better generalization. We refer to such large step size phase

as loss stabilization. The better generalization hints at some

hidden dynamics in the parameter space not captured by the

loss curves in Fig. 1. Our main contribution is to unveil

the hidden dynamics behind this phase: loss stabilization

helps to amplify the noise of SGD that drives the network

towards a solution with sparser features, meaning that for a

feature vector

ψ(x)

, only a few unique features are active

for a given input x.

1.1. Our Contributions

The effective dynamics behind loss stabilization. We

characterize two main components of the SGD dynamics

with large step sizes: (i) a fast movement determined by the

bouncing directions causing loss stabilization, (ii) a slow

dynamics driven by the combination of the gradient and the

multiplicative noise—which is non-vanishing due to the loss

stabilization.

arXiv:2210.05337v2 [cs.LG] 7 Jun 2023

SGD with Large Step Sizes Learns Sparse Features

Figure 1.

A typical training dynamics for a ResNet-18 trained on CIFAR-10. We use weight decay but no momentum or data

augmentation for this experiment. We see a substantial difference in generalization (as large as 12% vs. 35% test error) depending on the

step size ηand its schedule. When the training loss stabilizes, there is a hidden progress occurring which we aim to characterize.

SDE model and sparse feature learning. We model the ef-

fective slow dynamics during loss stabilization by a stochas-

tic differential equation (SDE) whose multiplicative noise

is related to the neural tangent kernel features, and vali-

date this modeling experimentally. Building on the existing

theory on diagonal linear networks, which shows that this

noise structure leads to sparse predictors, we conjecture a

similar “sparsifying” effect on the features of more complex

architectures. We experimentally conﬁrm this on neural

networks of increasing complexity.

Insights from our understanding. We draw a clear general

picture: the hidden optimization dynamics induced by large

step sizes and loss stabilization enable the transition to a

sparse feature learning regime. We argue that after a short

initial phase of training, SGD ﬁrst identiﬁes sparse features

of the training data and eventually ﬁts the data when the

step size is decreased. Finally, we discuss informally how

many deep learning regularization methods (weight decay,

BatchNorm, SAM) may also ﬁt into the same picture.

1.2. Related Work

He et al. (2016) popularized the piece-wise constant step

size schedule which often exhibits a clear loss stabilization

pattern which was later characterized theoretically in Li

et al. (2020) from the optimization point of view. However,

the regularization effect of this phase induced by the under-

lying hidden stochastic dynamics is still unclear. Li et al.

(2019b) analyzed the role of loss stabilization for a synthetic

distribution containing different patterns, but it is not clear

how this analysis can be extended to general problems. Jas-

trzebski et al. (2021) suggest that large step sizes prevent

the increase of local curvature during the early phase of

training. However, they do not provide an explanation for

this phenomenon.

The importance of large step sizes for generalization has

been investigated with diverse motivations. Many works

conjectured that large step sizes induce minimization of

some complexity measures related to the ﬂatness of minima

(Keskar et al.,2016;Smith & Le,2018;Smith et al.,2021;

Yang et al.,2022). Notably, Xing et al. (2018) point out that

SGD moves through the loss landscape bouncing between

the walls of a valley where the role of large step sizes is to

guide the SGD iterates towards a ﬂatter minimum. How-

ever, the correct ﬂatness measure is often disputed (Dinh

et al.,2017) and its role in understanding generalization is

questionable since full-batch GD with large step sizes (un-

like SGD) can lead to ﬂat solutions which don’t generalize

well (Kaur et al.,2022)

The attempts to explain the effect of large step size on

strongly convex models (Nakkiran,2020;Wu et al.,2021;

Beugnot et al.,2022) are inherently incomplete since it is a

phenomenon related to the existence of many zero solutions

with very different generalization properties. Works based

on stability analysis characterize the properties of the mini-

mum that SGD or GD can potentially converge depending

on the step size (Wu et al.,2018;Mulayoff et al.,2021;

Ma & Ying,2021;Nacson et al.,2022). However, these

approaches do not capture the entire training dynamics such

as the large step size phase that we consider where SGD

converges only after the step size is decayed.

To grasp the generalization of SGD, research has focused

on SGD augmented with label noise due to its beneﬁcial

regularization properties and resemblance to the standard

noise of SGD. Its implicit bias has been ﬁrst characterized

by Blanc et al. (2020) and extended by Li et al. (2022).

However, their analysis only holds in the ﬁnal phase of the

training, close to a zero-loss manifold. Our work instead

is closer in spirit to Pillaud-Vivien et al. (2022) where the

label noise dynamics is analyzed in the central phase of the

training, i.e., when the loss is still substantially above zero.

The dynamics of GD with large step sizes have received

a lot of attention in recent times, particularly the edge-of-

stability phenomenon (Cohen et al.,2021) and the catapult

mechanism (Lewkowycz et al.,2020;Wang et al.,2022).

However, the lack of stochastic noise in their analysis ren-

ders them incapable of capturing stochastic training. Note

SGD with Large Step Sizes Learns Sparse Features

that it is possible to bridge the gap between GD and SGD by

using explicit regularization as in Geiping et al. (2022). We

instead focus on the implicit regularization of SGD which

remains the most practical approach for training deep nets.

Finally, sparse features and low-rank structures in deep

networks have been commonly used for model compression,

knowledge distillation, and lottery ticket hypothesis (Denton

et al.,2014;Hinton et al.,2015;Frankle & Carbin,2018). A

common theme of all these works is the presence of hidden

structure in the networks learned by SGD which allows one

to come up with a much smaller network that approximates

well the original one. In particular, Hoeﬂer et al. (2021) note

that ReLU activations in deep networks trained with SGD

are typically much sparser than 50%. Our ﬁndings suggest

that the step size schedule can be the key component behind

emergence of such sparsity.

2. The Effective Dynamics of SGD with Large

Step Size: Sparse Feature Learning

In this section, we show that large step sizes may lead the

loss to stabilize by making SGD bounce above a valley.

We then unveil the effective dynamics induced by this loss

stabilization. To clarify our exposition we showcase our

results for the mean square error but other losses like the

cross-entropy carry the same key properties in terms of

the noise covariance (Wojtowytsch,2021b, Lemma 2.14).

We consider a generic parameterized family of prediction

functions

H:= {x→hθ(x), θ ∈Rp}

, a setting which

encompasses neural networks. In this case, the training loss

on input/output samples

(xi, yi)1≤i≤n∈Rd×R

is equal to

L(θ) := 1

i=1

(hθ(xi)−yi)2.(1)

We consider the overparameterized setting, i.e.

p≫n

hence, there shall exists many parameters

θ∗

that lead to

zero loss, i.e., perfectly interpolate the dataset. Therefore,

the question of which interpolator the algorithm converges

to is of paramount importance in terms of generalization. We

focus on the SGD recursion with step size

η > 0

, initialized

at θ0∈Rp: for all t∈N,

θt+1 =θt−η(hθt(xit)−yit)∇θhθt(xit),(2)

where

it∼ U (J1, nK)

is the uniform distribution over the

sample indices. In the following, note that SGD with mini

batches of size

B > 1

would lead to similar analysis but

with η/B instead of η.

2.1. Background: SGD is GD with Speciﬁc Label Noise

To emphasize the combined roles of gradient and noise,

we highlight the connection between the SGD dynamics

and that of full-batch GD plus a speciﬁc label noise. Such

manner of reformulating the dynamics has already been used

in previous works attempting to understand the speciﬁcity

of the SGD noise (HaoChen et al.,2021;Ziyin et al.,2022).

We formalize it in the following proposition.

Proposition 2.1. Let

(θt)t≥0

follow the SGD dynamics

Eq.(2) with the random sampling function

(it)t≥0

. For

t≥0, deﬁne the random vector ξt∈Rnsuch that

[ξt]i:= (hθt(xi)−yi)(1 −n1i=it),(3)

for

i∈J1, nK

and where

is the indicator of the event

Then

(θt)t≥0

follows the full-batch gradient dynamics on

with label noise (ξt)t≥0, that is

θt+1 =θt−η

i=1

(hθt(xi)−yt

i)∇θhθt(xi),(4)

where we deﬁne the random labels

yt:= y+ξt

. Further-

more,

ξt

is a mean zero random vector with variance such

that 1

n(n−1) E∥ξt∥2= 2L(θt).

This reformulation shows two crucial aspects of the SGD

noise: (i) the noisy part at state

always belongs to the linear

space spanned by

{∇θhθ(x1),...,∇θhθ(xn)}

, and (ii) it

scales as the training loss. Going further on (ii), we highlight

in the following section that the loss can stabilize because

of large step sizes: this may lead to a constant effective

scale of label noise. These two features are of paramount

importance when modelling the effective dynamics that take

place during loss stabilization.

2.2. The Effective Dynamics Behind Loss Stabilization

On loss stabilization. For generic quadratic costs, e.g.,

F(β) := ∥Xβ −y∥2

, gradient descent with step size

is convergent for

η < 2/λmax

, divergent for

η > 2/λmax

and converges to a bouncing

-periodic dynamics for

η=

2/λmax

, where

λmax

is the largest eigenvalue of the Hes-

sian. However, the practitioner is not likely to hit perfectly

this unstable step size and, almost surely, the dynamics shall

either converge or diverge. Yet, non-quadratic costs bring to

this picture a particular complexity: it has been shown that,

even for non-convex toy models, there exist an open interval

of step sizes for which the gradient descent neither converge

nor diverge (Ma et al.,2022;Chen & Bruna,2022). As we

are interested in SGD, we complement this result by pre-

senting an example in which loss stabilization occurs almost

surely in the case of stochastic updates. Indeed, consider a

regression problem with quadratic parameterization on one-

dimensional data inputs

’s, coming from a distribution

ˆρ

and outputs generated by the linear model

yi=xiθ2

∗

. The

loss writes

F(θ) := 1

4Eˆρy−xθ22

, and the SGD iterates

with step size η > 0follow, for any t∈N,

θt+1 =θt+η θtxityit−xitθ2

twhere xit∼ˆρ. (5)

SGD with Large Step Sizes Learns Sparse Features

For the sake of clarity, suppose that

θ∗= 1

and

supp(ˆρ) =

[a, b]

, we have the following proposition (a more general

result is presented in Proposition B.1 of the Appendix).

Proposition 2.2. For any

η∈(a−2,1.25 ·b−2)

and initial-

ization θ0∈(0,1), for all t > 0,

δ1< F (θt)< δ2almost surely, and (6)

∃T > 0,∀k > T, θt+2k<1< θt+2k+1 almost surely. (7)

where δ1, δ2, T > 0are constant given in the Appendix.

The proposition is divided in two parts: if the step size is

large enough, Eq.(6) the loss stabilizes in between level sets

δ1

and

δ2

and Eq.(7) shows that after some initial phase, the

iterates bounce from one side of the loss valley to the other

one. Note that despite the stochasticity of the process, the

results hold almost surely.

The effective dynamics. As observed in the prototypical

SGD training dynamics of Fig. 1and proved in the non-

convex toy model of Proposition 2.2, large step sizes lead

the loss to stabilize around some level set. To further un-

derstand the effect of this loss stabilization in parameter

space, we shall assume perfect stabilization. Then, from

Proposition 2.1, we conjecture the following behaviour

During loss stabilization, SGD is well modelled by GD with

constant label noise.

Label noise dynamics have been studied recently (Blanc

et al.,2020;Damian et al.,2021;Li et al.,2022) thanks

to their connection with Stochastic Differential Equa-

tions (SDEs). To properly write a SDE model, the

drift should match the gradient descent and the noise

should have the correct covariance structure (Li et al.,

2019a;Wojtowytsch,2021a). Proposition 2.1 implies that

the noise at state

is spanned by the gradient vectors

{∇θhθ(x1),...,∇θhθ(xn)}

and has a constant intensity

corresponding to the loss stabilization at a level

δ > 0

Hence, we propose the following SDE model

dθt=−∇θL(θt)dt+pηδ ϕθt(X)⊤dBt,(8)

where

(Bt)t≥0

is a standard Brownian motion in

and

ϕθ(X) := [∇θhθ(xi)⊤]n

i=1 ∈Rn×p

referred to as the Ja-

cobian (which is also the Neural Tangent Kernel (NTK)

feature matrix (Jacot et al.,2018)). This SDE can be seen

as the effective slow dynamics that drives the iterates while

they bounce rapidly in some directions at the level set

. It

highlights the combination of the deterministic part of the

full-batch gradient and the noise induced by SGD. Beyond

the theoretical justiﬁcation and consistency of this SDE

model, we validate it empirically in Sec. Cshowing that it

indeed captures the dynamics of large step size SGD. In

the next section, we leverage the SDE (8) to understand the

implicit bias of such learning dynamics.

2.3. Sparse Feature Learning

We begin with a simple model of diagonal linear networks

that showcase a sparsity inducing dynamics and further

disclose our general message about the overall implicit bias

promoted by the effective dynamics.

2.3.1. A WARM-UP:DIAGONAL LINEAR NETWORKS

An appealing example of simple non-linear networks that

help in forging an intuition for more complicated archi-

tectures is diagonal linear networks (Vaskevicius et al.,

2019;Woodworth et al.,2020;HaoChen et al.,2021;Pesme

et al.,2021). They are two-layer linear networks with

only diagonal connections: the prediction function writes

hu,v(x) = ⟨u, v ⊙x⟩=⟨u⊙v, x⟩

where

⊙

denotes ele-

mentwise multiplication. Even though the loss is convex

in the associated linear predictor

β:= u⊙v∈Rd

, it

is not in

(u, v)

, hence the training of such simple models

already exhibit a rich non-convex dynamics. In this case,

∇uhu,v(x) = v⊙x, and the SDE model Eq.(8) writes

dut=−∇uL(ut, vt) dt+pηδ vt⊙X⊤dBt,(9)

where

(Bt)t≥0

is a standard Brownian motion in

. Equa-

tions are symmetric for (vt)t≥0.

What is the behaviour of this effective dynamics?

(Pillaud-Vivien et al.,2022) answered this question by ana-

lyzing a similar stochastic dynamics and unveiled the sparse

nature of the resulting solutions. Indeed, under sparse recov-

ery assumptions, denoting

β∗

the sparsest linear predictor

that interpolates the data, it is shown that the associated

linear predictor

βt=ut⊙vt

: (i) converges exponentially

fast to zero outside of the support of

β∗

(ii) is with high

probability in a

O(√ηδ)

neighborhood of

β∗

in its support

after a time O(δ−1).

Overall conclusion on the model. During a ﬁrst phase,

SGD with large step sizes

decreases the training loss

until stabilization at some level set

δ > 0

. During this

loss stabilization, an effective noise-driven dynamics takes

place. It shrinks the coordinates outside of the support of

the sparsest signal and oscillates in parameter space at level

O(√ηδ)

on its support. Hence, decreasing later the step

size leads to perfect recovery of the sparsest predictor. This

behaviour is illustrated in our experiments in Figure 2.

2.3.2. THE SPARSE FEATURE LEARNING CONJECTURE

FOR MORE GENERAL MODELS

Results for diagonal linear nets recalled in the previous para-

graph show that the noisy dynamics (9) induce a sparsity

bias. As emphasized in HaoChen et al. (2021), this effect

is largely due to the multiplicative structure of the noise

v⊙[X⊤dBt]

that, in this case, has a shrinking effect on the

coordinates (because of the coordinate-wise multiplication

SGD with Large Step Sizes Learns Sparse Features

with

). In the general case, we see, thanks to Eq.(8), that

the same multiplicative structure of the noise still happens

but this time with respect to the Jacobian

ϕθ(X)

. Hence,

this suggests that similarly to the diagonal linear network

case, the implicit bias of the noise can lead to a shrink-

age effect applied to

ϕθ(X)

. This effect depends on the

noise intensity

and the step size of SGD. Indeed, an in-

teresting property of Brownian motion is that, for

v∈Rp

⟨v, Bt⟩=∥v∥2Wt

, where the equality holds in law and

(Wt)t≥0

is a one-dimensional Brownian motion. Hence, the

process Eq.(8) is equivalent to a process whose

-th coordi-

nate is driven by a noise proportional to

∥ϕi∥dWi

, where

ϕi

is the

-th column of

ϕθ(X)

and

(Wi

t)t≥0

is a Brownian

motion. This SDE structure, similar to the geometric Brow-

nian motion, is expected to induce the shrinkage of each

multiplicative factor (Oksendal,2013, Section 5.1), i.e., in

our case (∥∇θh(xi)∥)n

i=1. Thus, we conjecture:

The noise part of Eq.(8) seeks to minimize the ℓ2-norm of

the columns of ϕθ(X).

Note that the ﬁtting part of the dynamics prevents the Ja-

cobian to collapse totally to zero, but as soon as they are

not needed to ﬁt the signal, columns can be reduced to

zero. Remarkably, from a stability perspective, Blanc et al.

(2020) showed a similar bias: locally around a minimum,

the SGD dynamics implicitly tries to minimize the Frobe-

nius norm

∥ϕθ(X)∥F=Pn

i=1 ∥∇θhθ(xi)∥2

. Resolving

the above conjecture and characterizing the implicit bias

along the trajectory of SGD remains an exciting avenue for

future work. Now, we provide a speciﬁcation of this implicit

bias for different architectures:

•

Diagonal linear networks: For

hu,v(x) = ⟨u⊙v, x⟩

, we

have

∇u,vhu,v(x)=[v⊙x, u ⊙x]

. Thus, for a generic

data matrix

, minimizing the norm of each column of

ϕu,v(X)

amounts to put the maximal number of zero

coordinates and hence to minimize ∥u⊙v∥0.

•

ReLU networks: We take the prototypical one hid-

den layer to exhibit the sparsiﬁcation effect. Let

ha,W (x) = ⟨a, σ(W x)⟩

, then

∇aha,W (x) = σ(W x)

and

∇wjha,W (x) = ajx1⟨wj,x⟩>0

. Note that the

ℓ2

norm of the column corresponding to the neuron is re-

duced when it is activated at a minimal number of training

points, hence the implicit bias enables the learning of

sparse data-active features. Finally, when some direc-

tions are needed to ﬁt the data, similarly activated neurons

align to ﬁt, reducing the rank of ϕθ(X).

Feature sparsity. Our main insight is that the Jacobian

could be signiﬁcantly simpliﬁed during the loss stabiliza-

tion phase. Indeed, while the gradient part tries to ﬁt the

data and align neurons (see e.g. Fig. 10), the noise part of

Eq.(8) intends to minimize the

ℓ2

-norm of the columns of

ϕ(X)

. Hence, in combination, this motivates us to count the

average number of distinct (i.e., counting a group of aligned

neurons as one), non-zero activations over the training set.

We refer to this as the feature sparsity coefﬁcient (see the

next section for a detailed description). Note that the afore-

mentioned sparsity comes both in the number of distinct

neurons and their activation.

We show next that the conjectured sparsity is indeed ob-

served empirically for a variety of models. Remark that

both the feature sparsity coefﬁcient and the rank of

ϕθ(X)

can be used as a good proxy to track the hidden progress

during the loss stabilization phase.

3. Empirical Evidence of Sparse Feature

Learning Driven by SGD

Here we present empirical results for neural networks of in-

creasing complexity: from diagonal linear networks to deep

DenseNets on CIFAR-10, CIFAR-100, and Tiny ImageNet.

We make the following common observations for all these

networks trained using SGD schedules with large step sizes:

(O1)

Loss stabilization: training loss stabilizes around a

high level set until step size is decayed,

(O2)

Generalization beneﬁt: longer loss stabilization leads

to better generalization,

(O3)

Sparse feature learning: longer loss stabilization

leads to sparser features.

Importantly, we use no explicit regularization (in particular,

no weight decay) in our experiments so that the training

dynamics is driven purely by SGD and the step size sched-

ule. Additionally, in some cases, we cannot ﬁnd a single

large step size that would lead to loss stabilization. In such

cases, whenever explicitly mentioned, we use a warmup

step size schedule—i.e., increasing step sizes according to

some schedule—to make sure that the loss stabilizes around

some level set. Warmup is commonly used in practice (He

et al.,2016;Devlin et al.,2018) and often motivated purely

from the optimization perspective as a way to accelerate

training (Agarwal et al.,2021), but we suggest that it is also

a way to amplify the regularization effect of the SGD noise

which is proportional to the step size.

Measuring sparse feature learning. We track the sim-

pliﬁcation of the Jacobian by measuring both the feature

sparsity and the rank of

ϕθ(X)

. We compute the rank over

iterations for each model (except deep networks for which

it is prohibitively expensive) by using a ﬁxed threshold on

the singular values of ϕθ(X)normalized by the largest sin-

gular value. In this way, we ensure that the difference in

the rank that we detect is not simply due to different scales

ϕθ(X)

. Moreover, we always compute

ϕθ(X)

on the

number of fresh samples equal to the number of parameters

|θ|

to make sure that rank deﬁciency is not coming from

n≪ |θ|

which is the case in the overparametrized settings

we consider. To compute the feature sparsity coefﬁcient,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SGDwithLargeStepSizesLearnsSparseFeaturesMaksymAndriushchenko1AdityaVarre1LoucasPillaud-Vivien1NicolasFlammarion1AbstractWeshowcaseimportantfeaturesofthedynamicsoftheStochasticGradientDescent(SGD)inthetrainingofneuralnetworks.Wepresentempiri-calobservationsthatcommonlyusedlargestepsizes(i)mayleadthe...

展开>> 收起<<

SGD with Large Step Sizes Learns Sparse Features.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SGD with Large Step Sizes Learns Sparse Features

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: