
a question that whether we can calculate pertur-
bations for only some individual parameters, and
thus make the optimizer focus on these important
parameters.
To this end, we propose a novel optimization
approach, Fisher SAM (FSAM), which introduces
a Fisher mask to improve the efficiency and effec-
tiveness of SAM. In short, FSAM first uses the
Fisher information (Fisher,1922) as the metric to
identify the sharper parameters
2
and formulates
a binary Fisher mask correspondingly. Then, the
Fisher mask is multiplied with the perturbations
to obtain the sparse perturbations, which are lastly
used to perform regularization in the parameter
update. In this way, only parts of sharper parame-
ters will be added into the perturbations, and the
optimizer can thus focus more on these important
parameters. Also, the sparse perturbations could
ensure the training acceleration via sparse back-
propagation
3
. Moreover, one may concern that the
sparse Fisher mask would affect the convergence
rate of FSAM (Lin et al.,2019). Hence, we theoret-
ically provide the convergence analysis of FSAM,
ensuring that the convergence of FSAM is irrele-
vant to the Fisher mask.
We conduct a large-scale and systematic study
to evaluate the performance and effectiveness of
FSAM. Firstly, we apply SAM and FSAM to fine-
tune various PLMs on parts of GLUE and Super-
GLUE benchmarks, where the results show that
FSAM consistently outperforms the vanilla SAM
by 0.67
∼
1.98 average score among these PLMs,
and surpasses the Adam (Kingma and Ba,2015)
optimizer by 1.41
∼
1.91 points. Secondly, we
conduct experiments on two popular generation
tasks (i.e., XSUM and CoNLL2014) and prove that
FSAM can deliver promising results against SAM.
Lastly, quantitative analysis and in-depth discus-
sion demonstrate the universality and effectiveness
of FSAM in various complex scenarios, and prove
that FSAM indeed brings better model generaliza-
tion. Specifically, we show that our Fisher mask
strategy not only works well in the SAM, but also
can be applied to other SAM variants.
To summarize, our contributions are two-fold:
2
We refer to these parameters as the important ones, be-
cause they will rise steeply during optimization and affect the
model generalization significantly.
3
Since the fine-grained sparse training is limited to the
hardware, we do not achieve actual sparse speedup in this
work. Despite it, we still believe that FSAM has great potential
to achieve true training acceleration in the future, with the
development of hardware for fine-grained sparse operation.
(1) We propose a novel optimization approach
(namely FSAM) with theoretical convergence guar-
antee for PLMs. Specifically, FSAM improves the
performance and efficiency of recently-proposed
SAM via a Fisher mask strategy, which can also be
applied to more SAM variants. (2) Extensive exper-
iments show that FSAM consistently outperforms
the SAM by a large margin on both language un-
derstanding and generation tasks. The systematic
study demonstrates the effectiveness and universal-
ity of FSAM on improving model generalization.
2 Related Work
SAM and its variants.
Hochreiter and Schmid-
huber (1994) first show the strong correlation be-
tween the flat minima and the generalization of
a model, inspired by this, Foret et al. (2020) pro-
pose the SAM to find a flat minimum and thus
improve model generalization. While many ex-
isting works prove the effectiveness of SAM on
various computer vision tasks (Wu et al.,2020;
Chen et al.,2021;Zheng et al.,2021), the double
forward-propagation process of SAM brings more
computational cost. To this end, Du et al. (2021)
propose an Efficient SAM (ESAM) for reducing the
computational cost of SAM. Additionally, there are
also some efforts that focus on more efficient and
effective SAM optimization (Zhuang et al.,2021;
Kwon et al.,2021;Mi et al.,2022).
Improving Generalization.
Recently, we have
witnessed numerous PLMs that achieved tremen-
dous success in the community of NLP (Yang et al.,
2019;Devlin et al.,2019;Brown et al.,2020;Lewis
et al.,2020;Raffel et al.,2020;Joshi et al.,2020;
He et al.,2020;Qi et al.,2021;Zhong et al.,2022).
The current dominant fine-tuning approach needs to
tune all pretrained parameters for each downstream
task, which makes the PLM easily memorize the
training data and thus leads to overfitting. To tackle
this issue, some works attempt to provide implicit
and explicit regularization into the training of mod-
els, such as dropout (Srivastava et al.,2014), label
smoothing (Müller et al.,2019), mixup (Zhang
et al.,2018) and other data-augmentation meth-
ods (Sennrich et al.,2016;Wang et al.,2018b;
Zhong et al.,2021;Wang et al.,2022;Ding et al.,
2022). On the other hand, motivated by the suc-
cessful applications of SAM in the vision domain,
Bahri et al. (2022) involve applying SAM to opti-
mize the T5 (Raffel et al.,2020) model on multiple
language tasks and show that SAM can improve