Make Sharpness-Aware Minimization Stronger A Sparsified Perturbation Approach Peng Mi1Li Shen2yTianhe Ren1Yiyi Zhou1

2025-05-02 0 0 1.01MB 20 页 10玖币
侵权投诉
Make Sharpness-Aware Minimization Stronger:
A Sparsified Perturbation Approach
Peng Mi1Li Shen2Tianhe Ren1Yiyi Zhou1
Xiaoshuai Sun1Rongrong Ji1Dacheng Tao2,3
1Media Analytics and Computing Laboratory, Department of Artificial Intelligence,
School of Informatics, Xiamen University, China
2JD Explore Academy, Beijing, China 3The University of Sydney, Australia
mipeng@stu.xmu.edu.cn, mathshenli@gmail.com, rentianhe@stu.xmu.edu.cn
zhouyiyi@xmu.edu.cn, xssun@xmu.edu.cn
rrji@xmu.edu.cn, dacheng.tao@gmail.com
Abstract
Deep neural networks often suffer from poor generalization caused by complex and
non-convex loss landscapes. One of the popular solutions is Sharpness-Aware Min-
imization (SAM), which smooths the loss landscape via minimizing the maximized
change of training loss when adding a perturbation to the weight. However, we find
the indiscriminate perturbation of SAM on all parameters is suboptimal, which also
results in excessive computation, i.e., double the overhead of common optimizers
like Stochastic Gradient Descent (SGD). In this paper, we propose an efficient
and effective training scheme coined as Sparse SAM (SSAM), which achieves
sparse perturbation by a binary mask. To obtain the sparse mask, we provide
two solutions which are based onFisher information and dynamic sparse training,
respectively. In addition, we theoretically prove that SSAM can converge at the
same rate as SAM, i.e.,
O(log T/T)
. Sparse SAM not only has the potential
for training acceleration but also smooths the loss landscape effectively. Exten-
sive experimental results on CIFAR10, CIFAR100, and ImageNet-1K confirm the
superior efficiency of our method to SAM, and the performance is preserved or
even better with a perturbation of merely 50% sparsity. Code is available at
https:
//github.com/Mi-Peng/Sparse-Sharpness-Aware-Minimization.
1 Introduction
Over the past decade or so, the great success of deep learning has been due in great part to ever-larger
model parameter sizes [
13
,
57
,
53
,
10
,
40
,
5
]. However, the excessive parameters also make the model
inclined to poor generalization. To overcome this problem, numerous efforts have been devoted to
training algorithm [24,42,51], data augmentation [11,58,56], and network design [26,28].
One important finding in recent research is the connection between the geometry of loss landscape
and model generalization [
31
,
17
,
27
,
44
,
54
]. In general, the loss landscape of the model is complex
and non-convex, which makes model tend to converge to sharp minima. Recent endeavors [
31
,
27
,
44
]
show that the flatter the minima of convergence, the better the model generalization. This discovery
reveals the nature of previous approaches [
24
,
26
,
11
,
56
,
58
,
28
] to improve generalization, i.e.,
smoothing the loss landscape.
This work was done during an internship at JD Explore Academy.
Li Shen is the corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05177v2 [cs.LG] 23 Oct 2022
Based on this finding, Foret et al. [
17
] propose a novel approach to improve model generalization
called sharpness-aware minimization (SAM), which simultaneously minimizes loss value and loss
sharpness. SAM quantifies the landscape sharpness as the maximized difference of loss when a
perturbation is added to the weight. When the model reaches a sharp area, the perturbed gradients in
SAM help the model jump out of the sharp minima. In practice, SAM requires two forward-backward
computations for each optimization step, where the first computation is to obtain the perturbation and
the second one is for parameter update. Despite the remarkable performance [
17
,
34
,
14
,
7
], This
property makes SAM double the computational cost of the conventional optimizer, e.g., SGD [3].
Since SAM calculates perturbations indiscriminately for all parameters, a question is arisen:
Do we need to calculate perturbations for all parameters?
Above all, we notice that in most deep neural networks, only about 5% of parameters are sharp and
rise steeply during optimization [
31
]. Then we explore the effect of SAM in different dimensions to
answer the above question and find out (i) little difference between SGD and SAM gradients in most
dimensions (see Fig. 1); (ii) more flatter without SAM in some dimensions (see Fig. 4and Fig. 5).
Inspired by the above discoveries, we propose a novel scheme to improve the efficiency of SAM via
sparse perturbation, termed Sparse SAM (SSAM). SSAM, which plays the role of regularization, has
better generalization, and its sparse operation also ensures the efficiency of optimization. Specifically,
the perturbation in SSAM is multiplied by a binary sparse mask to determine which parameters
should be perturbed. To obtain the sparse mask, we provide two implementations. The first solution
is to use Fisher information [
16
] of the parameters to formulate the binary mask, dubbed SSAM-F.
The other one is to employ dynamic sparse training to jointly optimize model parameters and the
sparse mask, dubbed SSAM-D. The first solution is relatively more stable but a bit time-consuming,
while the latter is more efficient.
In addition to these solutions, we provide the theoretical convergence analysis of SAM and SSAM in
non-convex stochastic setting, proving that our SSAM can converge at the same rate as SAM, i.e.,
O(log T/T)
. At last, we evaluate the performance and effectiveness of SSAM on CIFAR10 [
33
],
CIFAR100 [
33
] and ImageNet [
8
] with various models. The experiments confirm that SSAM
contributes to a flatter landscape than SAM, and its performance is on par with or even better than
SAM with only about 50% perturbation. These results coincide with our motivations and findings.
To sum up, the contribution of this paper is three-fold:
We rethink the role of perturbation in SAM and find that the indiscriminate perturbations
are suboptimal and computationally inefficient.
We propose a sparsified perturbation approach called Sparse SAM (SSAM) with two
variants, i.e., Fisher SSAM (SSAM-F) and Dynamic SSAM (SSAM-D), both of which enjoy
better efficiency and effectiveness than SAM. We also theoretically prove that SSAM can
converge at the same rate as SAM, i.e.,O(log T/T).
We evaluate SSAM with various models on CIFAR and ImageNet, showing WideResNet
with SSAM of a high sparsity outperforms SAM on CIFAR; SSAM can achieve competitive
performance with a high sparsity; SSAM has a comparable convergence rate to SAM.
2 Related Work
In this section, we briefly review the studies on sharpness-aware minimum optimization (SAM),
Fisher information in deep learning, and dynamic sparse training.
SAM and flat minima.
Hochreiter et al. [
27
] first reveal that there is a strong correlation between the
generalization of a model and the flat minima. After that, there is a growing amount of research based
on this finding. Keskar et al. [
31
] conduct experiments with a larger batch size, and in consequence
observe the degradation of model generalization capability. They [
31
] also confirm the essence of this
phenomenon, which is that the model tends to converge to the sharp minima. Keskar et al. [
31
] and
Dinh et al. [
12
] state that the sharpness can be evaluated by the eigenvalues of the Hessian. However,
they fail to find the flat minima due to the notorious computational cost of Hessian.
Inspired by this, Foret et al. [
17
] introduce a sharpness-aware optimization (SAM) to find a flat
minimum for improving generalization capability, which is achieved by solving a mini-max problem.
2
Zhang et al. [
59
] make a point that SAM [
17
] is equivalent to adding the regularization of the gradient
norm by approximating Hessian matrix. Kwon et al. [
34
] propose a scale-invariant SAM scheme
with adaptive radius to improve training stability. Zhang et al. [
60
] redefine the landscape sharpness
from an intuitive and theoretical perspective based on SAM. To reduce the computational cost in
SAM, Du et al. [
14
] proposed Efficient SAM (ESAM) to randomly calculate perturbation. However,
ESAM randomly select the samples every steps, which may lead to optimization bias. Instead of the
perturbations for all parameters, i.e., SAM, we compute a sparse perturbation, i.e., SSAM, which
learns important but sparse dimensions for perturbation.
Fisher information (FI).
Fisher information [
16
] was proposed to measure the information that
an observable random variable carries about an unknown parameter of a distribution. In machine
learning, Fisher information is widely used to measure the importance of the model parameters [
32
]
and decide which parameter to be pruned [
50
,
52
]. For proposed SSAM-F, Fisher information is used
to determine whether a weight should be perturbed for flat minima.
Dynamic sparse training.
Finding the sparse network via pruning unimportant weights is a popular
solution in network compression, which can be traced back to decades [
35
]. The widely used
training scheme, i.e., pretraining-pruning-fine-tuning, is presented by Han et.al. [
23
]. Limited by the
requirement for the pre-trained model, some recent research [
15
,
2
,
9
,
30
,
43
,
37
,
38
] attempts to
discover a sparse network directly from the training process. Dynamic Sparse Training (DST) finds
the sparse structure by dynamic parameter reallocation. The criterion of pruning could be weight
magnitude [
18
], gradient [
15
] and Hessian [
35
,
49
], etc. We claim that different from the existing
DST methods that prune neurons, our target is to obtain a binary mask for sparse perturbation.
3 Rethinking the Perturbation in SAM
In this section, we first review how SAM converges at the flat minimum of a loss landscape. Then,
we rethink the role of perturbation in SAM.
3.1 Preliminary
In this paper, we consider the weights of a deep neural network as
w= (w1, w2, ..., wd)⊆ W Rd
and denote a binary mask as
m∈ {0,1}d
, which satisfies
1Tm= (1 s)·d
to restrict the
computational cost. Given a training dataset as
S,{(xi,yi)}n
i=1
i.i.d. drawn from the distribution
D
,
the per-data-point loss function is defined by
f(w,xi,yi)
. For the classification task in this paper, we
use cross-entropy as loss function. The population loss is defined by
fD=E(xi,yi)∼Df(w,xi,yi)
,
while the empirical training loss function is fS,1
nPn
i=1 f(w,xi,yi).
Sharpness-aware minimization (SAM) [
17
] aims to simultaneously minimize the loss value and
smooth the loss landscape, which is achieved by solving the min-max problem:
min
w
max
||||2ρfS(w+).(1)
SAM first obtains the perturbation
in a neighborhood ball area with a radius denoted as
ρ
. The
optimization tries to minimize the loss of the perturbed weight
w+
. Intuitively, the goal of SAM is
that small perturbations to the weight will not significantly rise the empirical loss, which indicates that
SAM tends to converge to a flat minimum. To solve the mini-max optimization, SAM approximately
calculates the perturbations using Taylor expansion around w:
= arg max
||||2ρ
fS(w+)arg max
||||2ρ
fS(w) + · ∇wf(w) = ρ· ∇wf(w)||∇wf(w)||2.(2)
In this way, the objective function can be rewritten as
minwfSw+ρwf(w)||∇wf(w)||2
,
which could be implemented by a two-step gradient descent framework in Pytorch or TensorFlow:
In the first step, the gradient at
w
is used to calculate the perturbation
by Eq. 2. Then the
weight of model will be added to w+.
In the second step, the gradient at
w+
is used to solve
minwfS(w+)
,i.e., update the
weight wby this gradient.
3
3.2 Rethinking the Perturbation Step of SAM
How does SAM work in flat subspace?
SAM perturbs all parameters indiscriminately, but the
fact is that merely about 5% parameter space is sharp while the rest is flat [
31
]. We are curious
whether perturbing the parameters in those already flat dimensions would lead to the instability of the
optimization and impair the improvement of generalization. To answer this question, we quantitatively
and qualitatively analyze the loss landscapes with different training schemes in Section 5, as shown
in Fig. 4and Fig. 5. The results confirm our conjecture that optimizing some dimensions without
perturbation can help the model generalize better.
What is the difference between the gradients of SGD and SAM?
We investigate various neural
networks optimized with SAM and SGD on CIFAR10/100 and ImageNet, whose statistics are given
in Fig. 1. We use the relative difference ratio
r
, defined as
r= log |(gSAM gSGD)/gSGD|
, to
measure the difference between the gradients of SAM and SGD. As showin in Fig. 1, the parameters
with
r
less than 0 account for the vast majority of all parameters, indicating that most SAM gradients
are not significantly different from SGD gradients. These results show that most parameters of the
model require no perturbations for achieving the flat minima, which well confirms our the motivation.
Figure 1: The distribution of relative difference ratio
r
among various models and datasets. There is
little difference between SAM and SGD gradients for most parameters, i.e., the ratio
r
is less than
0
.
Inspired by the above observation and the promising hardware acceleration for sparse operation
modern GPUs, we further propose Sparse SAM, a novel sparse perturbation approach, as an implicit
regularization to improve the efficiency and effectiveness of SAM.
4 Methodology
In this section, we first define the proposed Sparse SAM (SSAM), which strengths SAM via sparse
perturbation. Afterwards, we introduce the instantiations of the sparse mask used in SSAM via Fisher
information and dynamic sparse training, dubbed SSAM-F and SSAM-D, respectively.
4.1 Sparse SAM
Motivated by the finding discussed in the introduction, Sparse SAM (SSAM) employs a sparse
binary mask to decide which parameters should be perturbed, thereby improving the efficiency of
sharpness-aware minimization. Specifically, the perturbation
will be multiplied by a sparse binary
mask
m
, and the objective function is then rewritten as
minwfSw+ρ·wf(w)
||∇wf(w)||2m
. To
stable the optimization, the sparse binary mask
m
is updated at regular intervals during training. We
provide two solutions to obtain the sparse mask
m
, namely Fisher information based Sparse SAM
(SSAM-F) and dynamic sparse training based Sparse SAM (SSAM-D). The overall algorithms of
SSAM and sparse mask generations are shown in Algorithm 1and Algorithm 2, respectively.
According to the previous work [
55
] and Ampere architecture equipped with sparse tensor cores [
47
,
46
,
45
], currently there exists technical support for matrix multiplication with 50% fine-grained
sparsity [
55
]
3
. Therefore, SSAM of 50% sparse perturbation has great potential to achieve true
training acceleration via sparse back-propagation.
3For instance, 2:4 sparsity for A100 GPU.
4
摘要:

MakeSharpness-AwareMinimizationStronger:ASparsiedPerturbationApproachPengMi1LiShen2yTianheRen1YiyiZhou1XiaoshuaiSun1RongrongJi1DachengTao2;31MediaAnalyticsandComputingLaboratory,DepartmentofArticialIntelligence,SchoolofInformatics,XiamenUniversity,China2JDExploreAcademy,Beijing,China3TheUniversit...

展开>> 收起<<
Make Sharpness-Aware Minimization Stronger A Sparsified Perturbation Approach Peng Mi1Li Shen2yTianhe Ren1Yiyi Zhou1.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.01MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注