Exploring Mode Connectivity for Pre-trained Language Models Yujia Qin1 Cheng Qian1 Jing Yi1 Weize Chen1 Yankai Lin23y Xu Han1 Zhiyuan Liu145yMaosong Sun145y Jie Zhou6

2025-05-06 0 0 2.24MB 21 页 10玖币
侵权投诉
Exploring Mode Connectivity for Pre-trained Language Models
Yujia Qin1, Cheng Qian1, Jing Yi1, Weize Chen1, Yankai Lin2,3, Xu Han1,
Zhiyuan Liu1,4,5,Maosong Sun1,4,5, Jie Zhou6
1NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing
2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing
3Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing
4International Innovation Center of Tsinghua University, Shanghai
5Quan Cheng Laboratory 6Pattern Recognition Center, WeChat AI, Tencent Inc.
{qyj20, qianc20, yi-j20}@mails.tsinghua.edu.cn
Abstract
Recent years have witnessed the prevalent
application of pre-trained language models
(PLMs) in NLP. From the perspective of pa-
rameter space, PLMs provide generic initial-
ization, starting from which high-performance
minima could be found. Although plenty of
works have studied how to effectively and effi-
ciently adapt PLMs to high-performance min-
ima, little is known about the connection of
various minima reached under different adap-
tation configurations. In this paper, we in-
vestigate the geometric connections of differ-
ent minima through the lens of mode con-
nectivity, which measures whether two min-
ima can be connected with a low-loss path.
We conduct empirical analyses to investigate
three questions: (1) how could hyperparame-
ters, specific tuning methods, and training data
affect PLM’s mode connectivity? (2) How
does mode connectivity change during pre-
training? (3) How does the PLM’s task knowl-
edge change along the path connecting two
minima? In general, exploring the mode con-
nectivity of PLMs conduces to understanding
the geometric connection of different minima,
which may help us fathom the inner workings
of PLM downstream adaptation. The codes are
publicly available at https://github.com/
thunlp/Mode-Connectivity-PLM.
1 Introduction
Recent years have witnessed the prevalent appli-
cation of pre-trained language models (PLMs)
in NLP (Han et al.,2021), with the state-of-the-
art across various NLP tasks consistently being
pushed (Devlin et al.,2019;Liu et al.,2019b;Raffel
et al.,2020). Through large-scale self-supervised
training, PLMs acquire versatile semantic (Liu
Indicates equal contribution.
Corresponding author.
et al.,2019a) and syntactic (Tenney et al.,2019)
knowledge, which could be utilized when conduct-
ing transfer learning on downstream tasks.
From the perspective of parameter space, PLMs
provide generic initialization for downstream adap-
tation. Starting from the initialization, many
high-performance minima can be found through
gradient-based optimization. Up to now, plenty
of works have studied how to effectively and ef-
ficiently adapt PLMs to high-performance min-
ima, including adjusting hyperparameters (Liu and
Wang,2021), conducting transfer learning using
auxiliary training data (Pruksachatkun et al.,2020),
tuning PLMs in a parameter-efficient way (Ding
et al.,2022), etc. Under different adaptation con-
figurations, PLMs may finally reach local minima
distributed in highly distinct regions. Although
these minima all correspond to excellent perfor-
mance (low loss), little has been known about their
geometric connection in the parameter space.
A straightforward way to explore such geomet-
ric connection is to look into the loss landscape
around different minima, which is inherently in-
tractable due to the high dimensionality brought by
the tremendous parameter size of PLMs. Instead
of probing the full landscape, we propose to inves-
tigate the relation of different minima through the
lens of mode connectivity (Garipov et al.,2018),
which measures whether two different minima can
be connected via a parametric path, along which
the loss of the downstream task remains low. Ex-
ploring the mode connectivity of PLMs contributes
to understanding the geometric connection among
different minima. Such connection reflects the in-
herent relation of various adaptation configurations
and may help us fathom the inner workings of PLM
downstream adaptation under different settings.
To the best of our knowledge, systematic studies
for the mode connectivity of PLMs are still lacking.
arXiv:2210.14102v1 [cs.CL] 25 Oct 2022
In this paper, we first investigate what factors may
affect PLM’s mode connectivity by answering the
following research questions:
(Q1 ) How could different adaptation config-
urations (hyperparameters, tuning methods,
and training data) affect PLM’s mode connec-
tivity?
(Q2 ) How does mode connectivity change
during pre-training?
We first consider the mode connectivity when
different minima are trained on the same dataset.
We investigate the effects of several hyperparame-
ters (e.g., training data order, initialization of the
tunable parameters, training steps, etc.) on PLM’s
mode connectivity, and find that among these fac-
tors, initialization has the greatest impact. In addi-
tion, we show that fine-tuning leads to better mode
connectivity than parameter-efficient delta tuning
(e.g., adapter (Houlsby et al.,2019)).
Then we extend the connectivity analysis to min-
ima trained on different datasets. We demonstrate
that: (1) the mode connectivity is good for two
minima trained on data belonging to the same dis-
tribution, but without overlap of specific instances.
This means instead of memorizing training data,
PLMs learn advanced task-level knowledge during
training, and mode connectivity originates from
the high overlap of task knowledge of two minima.
(2) Although two minima trained on different tasks
are inherently disconnected, pre-training gradually
pulls the optimal regions of different tasks closer
in an implicit way. This phenomenon may help
explain PLM’s excellent cross-task transferability.
Beyond exploring the effects that could affect
PLM’s mode connectivity, we also study the in-
trinsic properties of model solutions between two
minima, which leads to the third question:
(Q3 ) How does the PLM’s task knowledge
change along the path connecting two min-
ima?
In the experiments, we observe that for two min-
ima obtained independently on two tasks, when
traversing from the minima trained on a source task
to that of a target task, a PLM suffers from catas-
trophic forgetting (McCloskey and Cohen,1989)
on the source task, and gradually absorbs the knowl-
edge from the target task. Besides, PLMs prioritize
forgetting those elusive source knowledge and ac-
quiring easy-to-grasp target knowledge.
In general, to fathom the connection of minima
reached under different settings, we conduct empir-
ical studies on the mode connectivity of PLMs. We
also show that our findings may have potential sig-
nificance in broad research areas, such as designing
better ensemble methods for PLMs, understanding
the task-level transferability of PLMs and reveal-
ing the mechanism of PLM downstream adaptation.
We expect our evaluation setup and findings could
inspire more future works in this field.
2 Related Work
Adaptation Strategies of PLMs.
To effectively
and efficiently utilize the knowledge learned during
pre-training, many strategies have been developed
to better tune a PLM, including: (1) hyperparam-
eter search, which aims to find an optimal hyper-
parameter configuration through traditional grid
search or modern automated search (Liu and Wang,
2021); (2) pre-finetuning, which trains PLMs on
intermediate auxiliary tasks before fine-tuning on a
target task (Pruksachatkun et al.,2020;Aghajanyan
et al.,2021). In this way, PLMs achieve better
downstream performance by taking advantage of
cross-dataset knowledge transfer; (3) prompt learn-
ing, which casts downstream tasks into the form
of the pre-training objective by leveraging natural
language prompts (Brown et al.,2020;Schick and
Schütze,2021a,b). Prompt learning exhibits supe-
rior performance especially under the few-shot and
zero-shot scenarios; (4) delta tuning (also known
as parameter-efficient tuning). Optimizing all the
parameters of a PLM is computationally cumber-
some. As a lightweight alternative, delta tuning
optimizes only a few tunable parameters and keeps
other parameters frozen, achieving comparable per-
formance to fine-tuning (Ding et al.,2022).
Although plenty of adaptation strategies have
been developed to better tune a PLM, little is un-
derstood about the connection of local minima
reached under different training configurations. In
this work, we take the first step by utilizing mode
connectivity as the analysis tool.
Mode Connectivity for Neural Networks.
De-
spite being extremely high-dimensional, the loss
landscape of neural networks exhibits a simple
geometric pattern of mode connectivity (Garipov
et al.,2018;Freeman and Bruna,2017;Draxler
et al.,2018). It is shown that starting from dif-
ferent initialization, the local minima obtained by
gradient-based optimizations are often connected
by low-loss paths, along which high-performance
solutions could be easily found and ensembled to
achieve better performance (Garipov et al.,2018).
These paths are typically non-linear curves, which
require a process of curve finding with task super-
vision. Excellent mode connectivity indicates that
different minima are not isolated points in the pa-
rameter space, but essentially form a connected
manifold (Draxler et al.,2018).
Frankle et al. (2020) further contend that from
the same initialization, local minima obtained with
different training data order can be connected by a
linear low-loss path, reducing the burden of curve
finding. Such a phenomenon is dubbed as linear
mode connectivity, which is closely related to lot-
tery ticket hypothesis (Frankle and Carbin,2019),
and has direct implications for continual learning
(Mirzadeh et al.,2020). Compared with the non-
linear counterpart, linear mode connectivity is a
stronger constraint, requiring that the convex com-
bination of two minima stay in the same loss basin.
Previous works typically study mode connectiv-
ity using non-pretrained models in the field of com-
puter vision. Until recently, Neyshabur et al. (2020)
observe linear mode connectivity on pre-trained
vision models. Despite the great efforts spent, a
systematic understanding of the mode connectivity
of PLMs is still lacking. In this paper, we focus
on investigating the effects that would influence
PLM’s mode connectivity and analyze the knowl-
edge variation along the connecting path. Different
from existing works that study mode connectivity
for minima trained on the same dataset, we addi-
tionally extend the analysis to different datasets.
3 Mode Connectivity Evaluation
Preliminaries.
Consider adapting a PLM on a
downstream task, let
C1
and
C2
be two distinct
sets of training configurations that may differ in
hyperparameters or data. We use
C1
and
C2
to train
two copies of the PLM independently. The specific
tuning method determines the tunable parameters
θ0
. After training,
θ0
is adapted to
θC1R|θ0|
and
θC2R|θ0|
, respectively, where
|θ0|
denotes the
total number of tunable parameters.
Connecting Path.
Based on Frankle et al.
(2020), the same initialization generally leads to a
linear low-loss path between two minima. Besides,
compared with the non-linear counterpart, linear-
ity is a more favorable property, which indicates a
closer connection between different minima. There-
fore, our first step is to investigate whether PLMs
have good linear mode connectivity. Specifically,
assume a continuous curve
φ(α) : [0,1] R|θ0|
connecting
θC1
and
θC2
, satisfying
φ(0) = θC1
and
φ(1) = θC2
, we consider a linear path as follows:
φ(α) = (1 α)·θC1+α·θC2.(1)
Connectivity Criterion.
After defining the
curve connecting both minima, we traverse along
the curve and evaluate the loss and performance
of the interpolations. We deem two minima
θC1
and
θC2
mode connected if there does not exist a
significant loss barrier or performance drop along
the defined curve between
θC1
and
θC2
. In the
experiments, we evaluate evenly distributed inter-
polations on φ(α).
4 Empirical Analysis
In this section, we conduct experiments to investi-
gate the aforementioned research questions.
Q1. (a) How could different hyperparame-
ters and the specific tuning method affect the
mode connectivity of PLMs?
We first investigate the effects of several hyperpa-
rameters that could affect PLM’s mode connectiv-
ity, including (1) training data order, initialization
of tunable parameters, training step (main paper),
(2) learning rate and batch size (appendix B.1). To
explore the effects of the specific tuning method,
we experiment with both fine-tuning and a represen-
tative delta tuning method, i.e., adapter (Houlsby
et al.,2019). Adapter inserts tunable modules
into a PLM and keeps other parameters fixed dur-
ing adaptation. Unless otherwise specified, we
mainly conduct the experiments using
T5BASE
(Raf-
fel et al.,2020), and choose two representative
NLP tasks (MNLI (Williams et al.,2018) and
ReCoRD (Zhang et al.,2018)).
In each experiment, the training configurations
of the two endpoints only differ in one hyperparam-
eter, while other settings are kept the same for a fair
comparison. To explore the effects of training steps,
we evaluate the performance when both endpoints
are trained for {
10
k,
30
k,
50
k} steps, respectively.
We evaluate
24
evenly distributed interpolations
and
2
endpoints along a linear path, i.e., we eval-
uate a series of
φ(α)
, where
α { 0
25 ,1
25 , ..., 25
25 }
.
Since we find that the trends of loss and perfor-
mance are generally highly correlated (i.e., a per-
formance drop corresponds to a loss barrier), we
Figure 1: The performance of linear interpolations between two minima trained with different training data order.
Figure 2: The performance of linear interpolations between two minima trained with different initialization.
report the performance in the main paper and leave
the results of loss in appendix D. All experiments
are conducted
3
times with random seeds and we
report the average results on test sets. For more
training details, please refer to appendix C.
Effects of Training Data Order.
PLM’s down-
stream adaptation generally involves mini-batch
gradient-based optimization, where training sam-
ples are learned in a random order. To explore
its effect, we adapt two copies of a PLM with
two different random data order. Then we visu-
alize the performance of linear interpolations in
Figure 1, from which we observe that for fine-
tuning, both endpoints are well connected by a
linear path; while for adapter tuning, there exists
a slight but negligible performance drop near the
midpoint. In general, we conclude that
local min-
ima are well connected under different random
training data order.
Effects of Initialization.
Before downstream
adaptation, additional parameters (e.g., extra mod-
ules defined by delta tuning, the classification head,
etc.) may be introduced; in addition, Wu et al.
(2022) recently show that adding noise to the pre-
trained weights improves the fine-tuning perfor-
mance on downstream tasks. Thus, both fine-
tuning and delta tuning require proper initializa-
tion for the tunable parameters. Since different
initialization could lead to distinct optimization tra-
jectories, we explore the effects of initialization on
PLM’s mode connectivity.
Specifically, for those newly introduced modules,
we randomly initialize them with a Gaussian distri-
bution; for those pre-trained weights that require
tuning, we add a random Gaussian noise. Two end-
points are initialized with the same configuration
(e.g., mean and standard deviation of the Gaussian
distribution), but different random seeds. The lin-
ear interpolation results are depicted in Figure 2,
from which we observe that the mode connectivity
of fine-tuning is generally good; while for adapter
tuning, there exists a significant performance drop
between two differently initialized minima. This
means starting from different initialization, PLMs
tend to reach non-connected local minima in the
parameter space, especially for delta tuning. In
short,
initialization of tunable parameters has a
great impact on mode connectivity.
Effects of Training Step.
As mentioned before,
the experiments in Figure 1and 2are conducted
when both minima are trained for {
10
k,
30
k,
50
k}
steps. Comparing the results at different training
steps, we observe that (1) longer training leads to
poorer connectivity for adapter tuning under cer-
tain cases; while (2) the mode connectivity of fine-
tuning is good at different steps. In appendix B.2,
we further show that (1) the mode connectivity be-
comes poorer when one endpoint is trained with
more steps while the other is trained with fixed
steps, and (2) with the training step increasing, the
Euclidean distance between two minima is also
prolonged, which may partially explain the poorer
Figure 3: The performance of interpolations along
a non-linear path connecting two minima, which are
trained with adapter tuning from different initialization.
mode connectivity.
Effects of Tuning Method.
Comparing the re-
sults of fine-tuning and adapter tuning in Figure 1
and 2, we observe that in general, the linear mode
connectivity of fine-tuning is better than adapter
tuning. In other words, when using fine-tuning,
PLMs are more likely to be optimized to linearly-
connected minima. A similar phenomenon also
occurs for minima trained with different learning
rates or batch sizes (see appendix B.1). Consider-
ing that adapter optimizes only
2.38
% parameters
than fine-tuning, we hypothesize that more tunable
parameters may yield better mode connectivity and
leave more explorations as future work.
Different Minima are Generally Connected by
a Non-linear Path.
Considering that linearity is
a strong constraint for mode connectivity, even if
a direct linear path connecting two minima incurs
a high loss, both minima may still be connected
by a low-loss non-linear path. To explore whether
this holds for PLMs, we follow the setting of tun-
ing adapters with different initialization, which has
been shown in Figure 2to have poor linear mode
connectivity. We try to use the supervision from the
downstream task to find a low-loss parametric path
connecting two endpoints
θC1
and
θC2
. Follow-
ing Garipov et al. (2018), we consider a quadratic
Bezier curve defined as follows:
φθ(α) = (1 α)2·θC1+ 2α(1 α)θ+α2·θC2,(2)
where
θR|θ0|
denotes tunable parameters of
the curve. During curve finding,
θC1
and
θC2
are
kept frozen, and only
θ
is optimized. Denote
L
as the loss function of the task, the training ob-
jective is to minimize the expectation of loss on
the curve over a uniform distribution
U(0,1)
, i.e.,
EαU(0,1)L(φθ(α))
. For more details, please refer
to appendix A.
Figure 4: Linear mode connectivity analysis for two
minima trained with in-distribution MNLI data. The
results on ReCoRD are left in appendix B.4.
We visualize the performance of the interpola-
tion on the found Bezier curve in Figure 3. We
observe that the two minima are well-connected by
the found Bezier curve, without a significant perfor-
mance drop. In fact, such a low-loss Bezier curve
exists for minima reached under various different
settings (see more experiments in appendix B.3).
Given the above results, we conjecture that
there
may exist multiple loss basins which are con-
nected via a low-loss non-linear path, instead
of a linear path. For most of the minima within
the same loss basin, their convex combination
also lies in this basin
. In this sense, if two minima
are connected linearly, then both of them probably
lie in the same basin; otherwise in different basins
(e.g., the case of adapter tuning with different ini-
tialization).
Q1. (b) What are the effects of training
data?
In previous experiments, we focus on the connec-
tivity of two minima trained with the same dataset.
From now on, we extend the mode connectivity to
two minima trained on different datasets, focusing
on two facets: data overlap and data domain.
Effects of Data Overlap.
PLMs have been
demonstrated to be adept at memorizing the train-
ing data (Carlini et al.,2021,2022). To show that
the connectivity of both minima does not originate
from PLM’s memorization, we explore whether
such mode connectivity still exists when two min-
ima are obtained on data belonging to the same
distribution, but without overlap of specific train-
ing samples. Specifically, we partition the original
training data of MNLI into two equal splits. Then
we adapt two copies of
T5BASE
on either split using
the same training configurations. The experiments
are conducted using both fine-tuning and adapter
tuning.
摘要:

ExploringModeConnectivityforPre-trainedLanguageModelsYujiaQin1,ChengQian1,JingYi1,WeizeChen1,YankaiLin2;3y,XuHan1,ZhiyuanLiu1;4;5y,MaosongSun1;4;5y,JieZhou61NLPGroup,DCST,IAI,BNRIST,TsinghuaUniversity,Beijing2GaolingSchoolofArticialIntelligence,RenminUniversityofChina,Beijing3BeijingKeyLaborator...

展开>> 收起<<
Exploring Mode Connectivity for Pre-trained Language Models Yujia Qin1 Cheng Qian1 Jing Yi1 Weize Chen1 Yankai Lin23y Xu Han1 Zhiyuan Liu145yMaosong Sun145y Jie Zhou6.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:21 页 大小:2.24MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注