Make Sharpness-Aware Minimization Stronger A Sparsiﬁed Perturbation Approach Peng Mi1Li Shen2yTianhe Ren1Yiyi Zhou1

2025-05-02 0 0 1.01MB 20 页 10玖币

侵权投诉

Make Sharpness-Aware Minimization Stronger:

A Sparsiﬁed Perturbation Approach

Peng Mi1∗Li Shen2†Tianhe Ren1Yiyi Zhou1

Xiaoshuai Sun1Rongrong Ji1Dacheng Tao2,3

1Media Analytics and Computing Laboratory, Department of Artiﬁcial Intelligence,

School of Informatics, Xiamen University, China

2JD Explore Academy, Beijing, China 3The University of Sydney, Australia

mipeng@stu.xmu.edu.cn, mathshenli@gmail.com, rentianhe@stu.xmu.edu.cn

zhouyiyi@xmu.edu.cn, xssun@xmu.edu.cn

rrji@xmu.edu.cn, dacheng.tao@gmail.com

Abstract

Deep neural networks often suffer from poor generalization caused by complex and

non-convex loss landscapes. One of the popular solutions is Sharpness-Aware Min-

imization (SAM), which smooths the loss landscape via minimizing the maximized

change of training loss when adding a perturbation to the weight. However, we ﬁnd

the indiscriminate perturbation of SAM on all parameters is suboptimal, which also

results in excessive computation, i.e., double the overhead of common optimizers

like Stochastic Gradient Descent (SGD). In this paper, we propose an efﬁcient

and effective training scheme coined as Sparse SAM (SSAM), which achieves

sparse perturbation by a binary mask. To obtain the sparse mask, we provide

two solutions which are based onFisher information and dynamic sparse training,

respectively. In addition, we theoretically prove that SSAM can converge at the

same rate as SAM, i.e.,

O(log T/√T)

. Sparse SAM not only has the potential

for training acceleration but also smooths the loss landscape effectively. Exten-

sive experimental results on CIFAR10, CIFAR100, and ImageNet-1K conﬁrm the

superior efﬁciency of our method to SAM, and the performance is preserved or

even better with a perturbation of merely 50% sparsity. Code is available at

https:

//github.com/Mi-Peng/Sparse-Sharpness-Aware-Minimization.

1 Introduction

Over the past decade or so, the great success of deep learning has been due in great part to ever-larger

model parameter sizes [

]. However, the excessive parameters also make the model

inclined to poor generalization. To overcome this problem, numerous efforts have been devoted to

training algorithm [24,42,51], data augmentation [11,58,56], and network design [26,28].

One important ﬁnding in recent research is the connection between the geometry of loss landscape

and model generalization [

]. In general, the loss landscape of the model is complex

and non-convex, which makes model tend to converge to sharp minima. Recent endeavors [

]

show that the ﬂatter the minima of convergence, the better the model generalization. This discovery

reveals the nature of previous approaches [

] to improve generalization, i.e.,

smoothing the loss landscape.

∗This work was done during an internship at JD Explore Academy.

†Li Shen is the corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05177v2 [cs.LG] 23 Oct 2022

Based on this ﬁnding, Foret et al. [

] propose a novel approach to improve model generalization

called sharpness-aware minimization (SAM), which simultaneously minimizes loss value and loss

sharpness. SAM quantiﬁes the landscape sharpness as the maximized difference of loss when a

perturbation is added to the weight. When the model reaches a sharp area, the perturbed gradients in

SAM help the model jump out of the sharp minima. In practice, SAM requires two forward-backward

computations for each optimization step, where the ﬁrst computation is to obtain the perturbation and

the second one is for parameter update. Despite the remarkable performance [

], This

property makes SAM double the computational cost of the conventional optimizer, e.g., SGD [3].

Since SAM calculates perturbations indiscriminately for all parameters, a question is arisen:

Do we need to calculate perturbations for all parameters?

Above all, we notice that in most deep neural networks, only about 5% of parameters are sharp and

rise steeply during optimization [

]. Then we explore the effect of SAM in different dimensions to

answer the above question and ﬁnd out (i) little difference between SGD and SAM gradients in most

dimensions (see Fig. 1); (ii) more ﬂatter without SAM in some dimensions (see Fig. 4and Fig. 5).

Inspired by the above discoveries, we propose a novel scheme to improve the efﬁciency of SAM via

sparse perturbation, termed Sparse SAM (SSAM). SSAM, which plays the role of regularization, has

better generalization, and its sparse operation also ensures the efﬁciency of optimization. Speciﬁcally,

the perturbation in SSAM is multiplied by a binary sparse mask to determine which parameters

should be perturbed. To obtain the sparse mask, we provide two implementations. The ﬁrst solution

is to use Fisher information [

] of the parameters to formulate the binary mask, dubbed SSAM-F.

The other one is to employ dynamic sparse training to jointly optimize model parameters and the

sparse mask, dubbed SSAM-D. The ﬁrst solution is relatively more stable but a bit time-consuming,

while the latter is more efﬁcient.

In addition to these solutions, we provide the theoretical convergence analysis of SAM and SSAM in

non-convex stochastic setting, proving that our SSAM can converge at the same rate as SAM, i.e.,

O(log T/√T)

. At last, we evaluate the performance and effectiveness of SSAM on CIFAR10 [

CIFAR100 [

] and ImageNet [

] with various models. The experiments conﬁrm that SSAM

contributes to a ﬂatter landscape than SAM, and its performance is on par with or even better than

SAM with only about 50% perturbation. These results coincide with our motivations and ﬁndings.

To sum up, the contribution of this paper is three-fold:

•

We rethink the role of perturbation in SAM and ﬁnd that the indiscriminate perturbations

are suboptimal and computationally inefﬁcient.

•

We propose a sparsiﬁed perturbation approach called Sparse SAM (SSAM) with two

variants, i.e., Fisher SSAM (SSAM-F) and Dynamic SSAM (SSAM-D), both of which enjoy

better efﬁciency and effectiveness than SAM. We also theoretically prove that SSAM can

converge at the same rate as SAM, i.e.,O(log T/√T).

•

We evaluate SSAM with various models on CIFAR and ImageNet, showing WideResNet

with SSAM of a high sparsity outperforms SAM on CIFAR; SSAM can achieve competitive

performance with a high sparsity; SSAM has a comparable convergence rate to SAM.

2 Related Work

In this section, we brieﬂy review the studies on sharpness-aware minimum optimization (SAM),

Fisher information in deep learning, and dynamic sparse training.

SAM and ﬂat minima.

Hochreiter et al. [

] ﬁrst reveal that there is a strong correlation between the

generalization of a model and the ﬂat minima. After that, there is a growing amount of research based

on this ﬁnding. Keskar et al. [

] conduct experiments with a larger batch size, and in consequence

observe the degradation of model generalization capability. They [

] also conﬁrm the essence of this

phenomenon, which is that the model tends to converge to the sharp minima. Keskar et al. [

] and

Dinh et al. [

] state that the sharpness can be evaluated by the eigenvalues of the Hessian. However,

they fail to ﬁnd the ﬂat minima due to the notorious computational cost of Hessian.

Inspired by this, Foret et al. [

] introduce a sharpness-aware optimization (SAM) to ﬁnd a ﬂat

minimum for improving generalization capability, which is achieved by solving a mini-max problem.

Zhang et al. [

] make a point that SAM [

] is equivalent to adding the regularization of the gradient

norm by approximating Hessian matrix. Kwon et al. [

] propose a scale-invariant SAM scheme

with adaptive radius to improve training stability. Zhang et al. [

] redeﬁne the landscape sharpness

from an intuitive and theoretical perspective based on SAM. To reduce the computational cost in

SAM, Du et al. [

] proposed Efﬁcient SAM (ESAM) to randomly calculate perturbation. However,

ESAM randomly select the samples every steps, which may lead to optimization bias. Instead of the

perturbations for all parameters, i.e., SAM, we compute a sparse perturbation, i.e., SSAM, which

learns important but sparse dimensions for perturbation.

Fisher information (FI).

Fisher information [

] was proposed to measure the information that

an observable random variable carries about an unknown parameter of a distribution. In machine

learning, Fisher information is widely used to measure the importance of the model parameters [

]

and decide which parameter to be pruned [

]. For proposed SSAM-F, Fisher information is used

to determine whether a weight should be perturbed for ﬂat minima.

Dynamic sparse training.

Finding the sparse network via pruning unimportant weights is a popular

solution in network compression, which can be traced back to decades [

]. The widely used

training scheme, i.e., pretraining-pruning-ﬁne-tuning, is presented by Han et.al. [

]. Limited by the

requirement for the pre-trained model, some recent research [

] attempts to

discover a sparse network directly from the training process. Dynamic Sparse Training (DST) ﬁnds

the sparse structure by dynamic parameter reallocation. The criterion of pruning could be weight

magnitude [

], gradient [

] and Hessian [

], etc. We claim that different from the existing

DST methods that prune neurons, our target is to obtain a binary mask for sparse perturbation.

3 Rethinking the Perturbation in SAM

In this section, we ﬁrst review how SAM converges at the ﬂat minimum of a loss landscape. Then,

we rethink the role of perturbation in SAM.

3.1 Preliminary

In this paper, we consider the weights of a deep neural network as

w= (w1, w2, ..., wd)⊆ W ∈ Rd

and denote a binary mask as

m∈ {0,1}d

, which satisﬁes

1Tm= (1 −s)·d

to restrict the

computational cost. Given a training dataset as

S,{(xi,yi)}n

i=1

i.i.d. drawn from the distribution

the per-data-point loss function is deﬁned by

f(w,xi,yi)

. For the classiﬁcation task in this paper, we

use cross-entropy as loss function. The population loss is deﬁned by

fD=E(xi,yi)∼Df(w,xi,yi)

while the empirical training loss function is fS,1

nPn

i=1 f(w,xi,yi).

Sharpness-aware minimization (SAM) [

] aims to simultaneously minimize the loss value and

smooth the loss landscape, which is achieved by solving the min-max problem:

min

max

||||2≤ρfS(w+).(1)

SAM ﬁrst obtains the perturbation



in a neighborhood ball area with a radius denoted as

. The

optimization tries to minimize the loss of the perturbed weight

w+

. Intuitively, the goal of SAM is

that small perturbations to the weight will not signiﬁcantly rise the empirical loss, which indicates that

SAM tends to converge to a ﬂat minimum. To solve the mini-max optimization, SAM approximately

calculates the perturbations using Taylor expansion around w:

= arg max

||||2≤ρ

fS(w+)≈arg max

||||2≤ρ

fS(w) + · ∇wf(w) = ρ· ∇wf(w)||∇wf(w)||2.(2)

In this way, the objective function can be rewritten as

minwfSw+ρ∇wf(w)||∇wf(w)||2

which could be implemented by a two-step gradient descent framework in Pytorch or TensorFlow:

•

In the ﬁrst step, the gradient at

is used to calculate the perturbation



by Eq. 2. Then the

weight of model will be added to w+.

•

In the second step, the gradient at

w+

is used to solve

minwfS(w+)

,i.e., update the

weight wby this gradient.

3.2 Rethinking the Perturbation Step of SAM

How does SAM work in ﬂat subspace?

SAM perturbs all parameters indiscriminately, but the

fact is that merely about 5% parameter space is sharp while the rest is ﬂat [

]. We are curious

whether perturbing the parameters in those already ﬂat dimensions would lead to the instability of the

optimization and impair the improvement of generalization. To answer this question, we quantitatively

and qualitatively analyze the loss landscapes with different training schemes in Section 5, as shown

in Fig. 4and Fig. 5. The results conﬁrm our conjecture that optimizing some dimensions without

perturbation can help the model generalize better.

What is the difference between the gradients of SGD and SAM?

We investigate various neural

networks optimized with SAM and SGD on CIFAR10/100 and ImageNet, whose statistics are given

in Fig. 1. We use the relative difference ratio

, deﬁned as

r= log |(gSAM −gSGD)/gSGD|

, to

measure the difference between the gradients of SAM and SGD. As showin in Fig. 1, the parameters

with

less than 0 account for the vast majority of all parameters, indicating that most SAM gradients

are not signiﬁcantly different from SGD gradients. These results show that most parameters of the

model require no perturbations for achieving the ﬂat minima, which well conﬁrms our the motivation.

Figure 1: The distribution of relative difference ratio

among various models and datasets. There is

little difference between SAM and SGD gradients for most parameters, i.e., the ratio

is less than

Inspired by the above observation and the promising hardware acceleration for sparse operation

modern GPUs, we further propose Sparse SAM, a novel sparse perturbation approach, as an implicit

regularization to improve the efﬁciency and effectiveness of SAM.

4 Methodology

In this section, we ﬁrst deﬁne the proposed Sparse SAM (SSAM), which strengths SAM via sparse

perturbation. Afterwards, we introduce the instantiations of the sparse mask used in SSAM via Fisher

information and dynamic sparse training, dubbed SSAM-F and SSAM-D, respectively.

4.1 Sparse SAM

Motivated by the ﬁnding discussed in the introduction, Sparse SAM (SSAM) employs a sparse

binary mask to decide which parameters should be perturbed, thereby improving the efﬁciency of

sharpness-aware minimization. Speciﬁcally, the perturbation



will be multiplied by a sparse binary

mask

, and the objective function is then rewritten as

minwfSw+ρ·∇wf(w)

||∇wf(w)||2m

. To

stable the optimization, the sparse binary mask

is updated at regular intervals during training. We

provide two solutions to obtain the sparse mask

, namely Fisher information based Sparse SAM

(SSAM-F) and dynamic sparse training based Sparse SAM (SSAM-D). The overall algorithms of

SSAM and sparse mask generations are shown in Algorithm 1and Algorithm 2, respectively.

According to the previous work [

] and Ampere architecture equipped with sparse tensor cores [

], currently there exists technical support for matrix multiplication with 50% ﬁne-grained

sparsity [

]

. Therefore, SSAM of 50% sparse perturbation has great potential to achieve true

training acceleration via sparse back-propagation.

3For instance, 2:4 sparsity for A100 GPU.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MakeSharpness-AwareMinimizationStronger:ASparsiedPerturbationApproachPengMi1LiShen2yTianheRen1YiyiZhou1XiaoshuaiSun1RongrongJi1DachengTao2;31MediaAnalyticsandComputingLaboratory,DepartmentofArticialIntelligence,SchoolofInformatics,XiamenUniversity,China2JDExploreAcademy,Beijing,China3TheUniversit...

展开>> 收起<<

Make Sharpness-Aware Minimization Stronger A Sparsiﬁed Perturbation Approach Peng Mi1Li Shen2yTianhe Ren1Yiyi Zhou1.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Make Sharpness-Aware Minimization Stronger A Sparsiﬁed Perturbation Approach Peng Mi1Li Shen2yTianhe Ren1Yiyi Zhou1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: