Exploring Mode Connectivity for Pre-trained Language Models Yujia Qin1 Cheng Qian1 Jing Yi1 Weize Chen1 Yankai Lin23y Xu Han1 Zhiyuan Liu145yMaosong Sun145y Jie Zhou6

2025-05-06 0 0 2.24MB 21 页 10玖币

侵权投诉

Exploring Mode Connectivity for Pre-trained Language Models

Yujia Qin1∗, Cheng Qian1∗, Jing Yi1∗, Weize Chen1, Yankai Lin2,3†, Xu Han1,

Zhiyuan Liu1,4,5†,Maosong Sun1,4,5†, Jie Zhou6

1NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing

2Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing

3Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing

4International Innovation Center of Tsinghua University, Shanghai

5Quan Cheng Laboratory 6Pattern Recognition Center, WeChat AI, Tencent Inc.

{qyj20, qianc20, yi-j20}@mails.tsinghua.edu.cn

Abstract

Recent years have witnessed the prevalent

application of pre-trained language models

(PLMs) in NLP. From the perspective of pa-

rameter space, PLMs provide generic initial-

ization, starting from which high-performance

minima could be found. Although plenty of

works have studied how to effectively and efﬁ-

ciently adapt PLMs to high-performance min-

ima, little is known about the connection of

various minima reached under different adap-

tation conﬁgurations. In this paper, we in-

vestigate the geometric connections of differ-

ent minima through the lens of mode con-

nectivity, which measures whether two min-

ima can be connected with a low-loss path.

We conduct empirical analyses to investigate

three questions: (1) how could hyperparame-

ters, speciﬁc tuning methods, and training data

affect PLM’s mode connectivity? (2) How

does mode connectivity change during pre-

training? (3) How does the PLM’s task knowl-

edge change along the path connecting two

minima? In general, exploring the mode con-

nectivity of PLMs conduces to understanding

the geometric connection of different minima,

which may help us fathom the inner workings

of PLM downstream adaptation. The codes are

publicly available at https://github.com/

thunlp/Mode-Connectivity-PLM.

1 Introduction

Recent years have witnessed the prevalent appli-

cation of pre-trained language models (PLMs)

in NLP (Han et al.,2021), with the state-of-the-

art across various NLP tasks consistently being

pushed (Devlin et al.,2019;Liu et al.,2019b;Raffel

et al.,2020). Through large-scale self-supervised

training, PLMs acquire versatile semantic (Liu

∗Indicates equal contribution.

†Corresponding author.

et al.,2019a) and syntactic (Tenney et al.,2019)

knowledge, which could be utilized when conduct-

ing transfer learning on downstream tasks.

From the perspective of parameter space, PLMs

provide generic initialization for downstream adap-

tation. Starting from the initialization, many

high-performance minima can be found through

gradient-based optimization. Up to now, plenty

of works have studied how to effectively and ef-

ﬁciently adapt PLMs to high-performance min-

ima, including adjusting hyperparameters (Liu and

Wang,2021), conducting transfer learning using

auxiliary training data (Pruksachatkun et al.,2020),

tuning PLMs in a parameter-efﬁcient way (Ding

et al.,2022), etc. Under different adaptation con-

ﬁgurations, PLMs may ﬁnally reach local minima

distributed in highly distinct regions. Although

these minima all correspond to excellent perfor-

mance (low loss), little has been known about their

geometric connection in the parameter space.

A straightforward way to explore such geomet-

ric connection is to look into the loss landscape

around different minima, which is inherently in-

tractable due to the high dimensionality brought by

the tremendous parameter size of PLMs. Instead

of probing the full landscape, we propose to inves-

tigate the relation of different minima through the

lens of mode connectivity (Garipov et al.,2018),

which measures whether two different minima can

be connected via a parametric path, along which

the loss of the downstream task remains low. Ex-

ploring the mode connectivity of PLMs contributes

to understanding the geometric connection among

different minima. Such connection reﬂects the in-

herent relation of various adaptation conﬁgurations

and may help us fathom the inner workings of PLM

downstream adaptation under different settings.

To the best of our knowledge, systematic studies

for the mode connectivity of PLMs are still lacking.

arXiv:2210.14102v1 [cs.CL] 25 Oct 2022

In this paper, we ﬁrst investigate what factors may

affect PLM’s mode connectivity by answering the

following research questions:

•

(Q1 ) How could different adaptation conﬁg-

urations (hyperparameters, tuning methods,

and training data) affect PLM’s mode connec-

tivity?

•

(Q2 ) How does mode connectivity change

during pre-training?

We ﬁrst consider the mode connectivity when

different minima are trained on the same dataset.

We investigate the effects of several hyperparame-

ters (e.g., training data order, initialization of the

tunable parameters, training steps, etc.) on PLM’s

mode connectivity, and ﬁnd that among these fac-

tors, initialization has the greatest impact. In addi-

tion, we show that ﬁne-tuning leads to better mode

connectivity than parameter-efﬁcient delta tuning

(e.g., adapter (Houlsby et al.,2019)).

Then we extend the connectivity analysis to min-

ima trained on different datasets. We demonstrate

that: (1) the mode connectivity is good for two

minima trained on data belonging to the same dis-

tribution, but without overlap of speciﬁc instances.

This means instead of memorizing training data,

PLMs learn advanced task-level knowledge during

training, and mode connectivity originates from

the high overlap of task knowledge of two minima.

(2) Although two minima trained on different tasks

are inherently disconnected, pre-training gradually

pulls the optimal regions of different tasks closer

in an implicit way. This phenomenon may help

explain PLM’s excellent cross-task transferability.

Beyond exploring the effects that could affect

PLM’s mode connectivity, we also study the in-

trinsic properties of model solutions between two

minima, which leads to the third question:

•

(Q3 ) How does the PLM’s task knowledge

change along the path connecting two min-

ima?

In the experiments, we observe that for two min-

ima obtained independently on two tasks, when

traversing from the minima trained on a source task

to that of a target task, a PLM suffers from catas-

trophic forgetting (McCloskey and Cohen,1989)

on the source task, and gradually absorbs the knowl-

edge from the target task. Besides, PLMs prioritize

forgetting those elusive source knowledge and ac-

quiring easy-to-grasp target knowledge.

In general, to fathom the connection of minima

reached under different settings, we conduct empir-

ical studies on the mode connectivity of PLMs. We

also show that our ﬁndings may have potential sig-

niﬁcance in broad research areas, such as designing

better ensemble methods for PLMs, understanding

the task-level transferability of PLMs and reveal-

ing the mechanism of PLM downstream adaptation.

We expect our evaluation setup and ﬁndings could

inspire more future works in this ﬁeld.

2 Related Work

Adaptation Strategies of PLMs.

To effectively

and efﬁciently utilize the knowledge learned during

pre-training, many strategies have been developed

to better tune a PLM, including: (1) hyperparam-

eter search, which aims to ﬁnd an optimal hyper-

parameter conﬁguration through traditional grid

search or modern automated search (Liu and Wang,

2021); (2) pre-ﬁnetuning, which trains PLMs on

intermediate auxiliary tasks before ﬁne-tuning on a

target task (Pruksachatkun et al.,2020;Aghajanyan

et al.,2021). In this way, PLMs achieve better

downstream performance by taking advantage of

cross-dataset knowledge transfer; (3) prompt learn-

ing, which casts downstream tasks into the form

of the pre-training objective by leveraging natural

language prompts (Brown et al.,2020;Schick and

Schütze,2021a,b). Prompt learning exhibits supe-

rior performance especially under the few-shot and

zero-shot scenarios; (4) delta tuning (also known

as parameter-efﬁcient tuning). Optimizing all the

parameters of a PLM is computationally cumber-

some. As a lightweight alternative, delta tuning

optimizes only a few tunable parameters and keeps

other parameters frozen, achieving comparable per-

formance to ﬁne-tuning (Ding et al.,2022).

Although plenty of adaptation strategies have

been developed to better tune a PLM, little is un-

derstood about the connection of local minima

reached under different training conﬁgurations. In

this work, we take the ﬁrst step by utilizing mode

connectivity as the analysis tool.

Mode Connectivity for Neural Networks.

De-

spite being extremely high-dimensional, the loss

landscape of neural networks exhibits a simple

geometric pattern of mode connectivity (Garipov

et al.,2018;Freeman and Bruna,2017;Draxler

et al.,2018). It is shown that starting from dif-

ferent initialization, the local minima obtained by

gradient-based optimizations are often connected

by low-loss paths, along which high-performance

solutions could be easily found and ensembled to

achieve better performance (Garipov et al.,2018).

These paths are typically non-linear curves, which

require a process of curve ﬁnding with task super-

vision. Excellent mode connectivity indicates that

different minima are not isolated points in the pa-

rameter space, but essentially form a connected

manifold (Draxler et al.,2018).

Frankle et al. (2020) further contend that from

the same initialization, local minima obtained with

different training data order can be connected by a

linear low-loss path, reducing the burden of curve

ﬁnding. Such a phenomenon is dubbed as linear

mode connectivity, which is closely related to lot-

tery ticket hypothesis (Frankle and Carbin,2019),

and has direct implications for continual learning

(Mirzadeh et al.,2020). Compared with the non-

linear counterpart, linear mode connectivity is a

stronger constraint, requiring that the convex com-

bination of two minima stay in the same loss basin.

Previous works typically study mode connectiv-

ity using non-pretrained models in the ﬁeld of com-

puter vision. Until recently, Neyshabur et al. (2020)

observe linear mode connectivity on pre-trained

vision models. Despite the great efforts spent, a

systematic understanding of the mode connectivity

of PLMs is still lacking. In this paper, we focus

on investigating the effects that would inﬂuence

PLM’s mode connectivity and analyze the knowl-

edge variation along the connecting path. Different

from existing works that study mode connectivity

for minima trained on the same dataset, we addi-

tionally extend the analysis to different datasets.

3 Mode Connectivity Evaluation

Preliminaries.

Consider adapting a PLM on a

downstream task, let

and

be two distinct

sets of training conﬁgurations that may differ in

hyperparameters or data. We use

and

to train

two copies of the PLM independently. The speciﬁc

tuning method determines the tunable parameters

θ0

. After training,

θ0

is adapted to

θC1∈R|θ0|

and

θC2∈R|θ0|

, respectively, where

|θ0|

denotes the

total number of tunable parameters.

Connecting Path.

Based on Frankle et al.

(2020), the same initialization generally leads to a

linear low-loss path between two minima. Besides,

compared with the non-linear counterpart, linear-

ity is a more favorable property, which indicates a

closer connection between different minima. There-

fore, our ﬁrst step is to investigate whether PLMs

have good linear mode connectivity. Speciﬁcally,

assume a continuous curve

φ(α) : [0,1] →R|θ0|

connecting

θC1

and

θC2

, satisfying

φ(0) = θC1

and

φ(1) = θC2

, we consider a linear path as follows:

φ(α) = (1 −α)·θC1+α·θC2.(1)

Connectivity Criterion.

After deﬁning the

curve connecting both minima, we traverse along

the curve and evaluate the loss and performance

of the interpolations. We deem two minima

θC1

and

θC2

mode connected if there does not exist a

signiﬁcant loss barrier or performance drop along

the deﬁned curve between

θC1

and

θC2

. In the

experiments, we evaluate evenly distributed inter-

polations on φ(α).

4 Empirical Analysis

In this section, we conduct experiments to investi-

gate the aforementioned research questions.

Q1. (a) How could different hyperparame-

ters and the speciﬁc tuning method affect the

mode connectivity of PLMs?

We ﬁrst investigate the effects of several hyperpa-

rameters that could affect PLM’s mode connectiv-

ity, including (1) training data order, initialization

of tunable parameters, training step (main paper),

(2) learning rate and batch size (appendix B.1). To

explore the effects of the speciﬁc tuning method,

we experiment with both ﬁne-tuning and a represen-

tative delta tuning method, i.e., adapter (Houlsby

et al.,2019). Adapter inserts tunable modules

into a PLM and keeps other parameters ﬁxed dur-

ing adaptation. Unless otherwise speciﬁed, we

mainly conduct the experiments using

T5BASE

(Raf-

fel et al.,2020), and choose two representative

NLP tasks (MNLI (Williams et al.,2018) and

ReCoRD (Zhang et al.,2018)).

In each experiment, the training conﬁgurations

of the two endpoints only differ in one hyperparam-

eter, while other settings are kept the same for a fair

comparison. To explore the effects of training steps,

we evaluate the performance when both endpoints

are trained for {

k} steps, respectively.

We evaluate

evenly distributed interpolations

and

endpoints along a linear path, i.e., we eval-

uate a series of

φ(α)

, where

α∈ { 0

25 ,1

25 , ..., 25

25 }

Since we ﬁnd that the trends of loss and perfor-

mance are generally highly correlated (i.e., a per-

formance drop corresponds to a loss barrier), we

Figure 1: The performance of linear interpolations between two minima trained with different training data order.

Figure 2: The performance of linear interpolations between two minima trained with different initialization.

report the performance in the main paper and leave

the results of loss in appendix D. All experiments

are conducted

times with random seeds and we

report the average results on test sets. For more

training details, please refer to appendix C.

Effects of Training Data Order.

PLM’s down-

stream adaptation generally involves mini-batch

gradient-based optimization, where training sam-

ples are learned in a random order. To explore

its effect, we adapt two copies of a PLM with

two different random data order. Then we visu-

alize the performance of linear interpolations in

Figure 1, from which we observe that for ﬁne-

tuning, both endpoints are well connected by a

linear path; while for adapter tuning, there exists

a slight but negligible performance drop near the

midpoint. In general, we conclude that

local min-

ima are well connected under different random

training data order.

Effects of Initialization.

Before downstream

adaptation, additional parameters (e.g., extra mod-

ules deﬁned by delta tuning, the classiﬁcation head,

etc.) may be introduced; in addition, Wu et al.

(2022) recently show that adding noise to the pre-

trained weights improves the ﬁne-tuning perfor-

mance on downstream tasks. Thus, both ﬁne-

tuning and delta tuning require proper initializa-

tion for the tunable parameters. Since different

initialization could lead to distinct optimization tra-

jectories, we explore the effects of initialization on

PLM’s mode connectivity.

Speciﬁcally, for those newly introduced modules,

we randomly initialize them with a Gaussian distri-

bution; for those pre-trained weights that require

tuning, we add a random Gaussian noise. Two end-

points are initialized with the same conﬁguration

(e.g., mean and standard deviation of the Gaussian

distribution), but different random seeds. The lin-

ear interpolation results are depicted in Figure 2,

from which we observe that the mode connectivity

of ﬁne-tuning is generally good; while for adapter

tuning, there exists a signiﬁcant performance drop

between two differently initialized minima. This

means starting from different initialization, PLMs

tend to reach non-connected local minima in the

parameter space, especially for delta tuning. In

short,

initialization of tunable parameters has a

great impact on mode connectivity.

Effects of Training Step.

As mentioned before,

the experiments in Figure 1and 2are conducted

when both minima are trained for {

steps. Comparing the results at different training

steps, we observe that (1) longer training leads to

poorer connectivity for adapter tuning under cer-

tain cases; while (2) the mode connectivity of ﬁne-

tuning is good at different steps. In appendix B.2,

we further show that (1) the mode connectivity be-

comes poorer when one endpoint is trained with

more steps while the other is trained with ﬁxed

steps, and (2) with the training step increasing, the

Euclidean distance between two minima is also

prolonged, which may partially explain the poorer

Figure 3: The performance of interpolations along

a non-linear path connecting two minima, which are

trained with adapter tuning from different initialization.

mode connectivity.

Effects of Tuning Method.

Comparing the re-

sults of ﬁne-tuning and adapter tuning in Figure 1

and 2, we observe that in general, the linear mode

connectivity of ﬁne-tuning is better than adapter

tuning. In other words, when using ﬁne-tuning,

PLMs are more likely to be optimized to linearly-

connected minima. A similar phenomenon also

occurs for minima trained with different learning

rates or batch sizes (see appendix B.1). Consider-

ing that adapter optimizes only

2.38

% parameters

than ﬁne-tuning, we hypothesize that more tunable

parameters may yield better mode connectivity and

leave more explorations as future work.

Different Minima are Generally Connected by

a Non-linear Path.

Considering that linearity is

a strong constraint for mode connectivity, even if

a direct linear path connecting two minima incurs

a high loss, both minima may still be connected

by a low-loss non-linear path. To explore whether

this holds for PLMs, we follow the setting of tun-

ing adapters with different initialization, which has

been shown in Figure 2to have poor linear mode

connectivity. We try to use the supervision from the

downstream task to ﬁnd a low-loss parametric path

connecting two endpoints

θC1

and

θC2

. Follow-

ing Garipov et al. (2018), we consider a quadratic

Bezier curve deﬁned as follows:

φθ(α) = (1 −α)2·θC1+ 2α(1 −α)θ+α2·θC2,(2)

where

θ∈R|θ0|

denotes tunable parameters of

the curve. During curve ﬁnding,

θC1

and

θC2

are

kept frozen, and only

is optimized. Denote

as the loss function of the task, the training ob-

jective is to minimize the expectation of loss on

the curve over a uniform distribution

U(0,1)

, i.e.,

Eα∈U(0,1)L(φθ(α))

. For more details, please refer

to appendix A.

Figure 4: Linear mode connectivity analysis for two

minima trained with in-distribution MNLI data. The

results on ReCoRD are left in appendix B.4.

We visualize the performance of the interpola-

tion on the found Bezier curve in Figure 3. We

observe that the two minima are well-connected by

the found Bezier curve, without a signiﬁcant perfor-

mance drop. In fact, such a low-loss Bezier curve

exists for minima reached under various different

settings (see more experiments in appendix B.3).

Given the above results, we conjecture that

there

may exist multiple loss basins which are con-

nected via a low-loss non-linear path, instead

of a linear path. For most of the minima within

the same loss basin, their convex combination

also lies in this basin

. In this sense, if two minima

are connected linearly, then both of them probably

lie in the same basin; otherwise in different basins

(e.g., the case of adapter tuning with different ini-

tialization).

Q1. (b) What are the effects of training

data?

In previous experiments, we focus on the connec-

tivity of two minima trained with the same dataset.

From now on, we extend the mode connectivity to

two minima trained on different datasets, focusing

on two facets: data overlap and data domain.

Effects of Data Overlap.

PLMs have been

demonstrated to be adept at memorizing the train-

ing data (Carlini et al.,2021,2022). To show that

the connectivity of both minima does not originate

from PLM’s memorization, we explore whether

such mode connectivity still exists when two min-

ima are obtained on data belonging to the same

distribution, but without overlap of speciﬁc train-

ing samples. Speciﬁcally, we partition the original

training data of MNLI into two equal splits. Then

we adapt two copies of

T5BASE

on either split using

the same training conﬁgurations. The experiments

are conducted using both ﬁne-tuning and adapter

tuning.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExploringModeConnectivityforPre-trainedLanguageModelsYujiaQin1,ChengQian1,JingYi1,WeizeChen1,YankaiLin2;3y,XuHan1,ZhiyuanLiu1;4;5y,MaosongSun1;4;5y,JieZhou61NLPGroup,DCST,IAI,BNRIST,TsinghuaUniversity,Beijing2GaolingSchoolofArticialIntelligence,RenminUniversityofChina,Beijing3BeijingKeyLaborator...

展开>> 收起<<

Exploring Mode Connectivity for Pre-trained Language Models Yujia Qin1 Cheng Qian1 Jing Yi1 Weize Chen1 Yankai Lin23y Xu Han1 Zhiyuan Liu145yMaosong Sun145y Jie Zhou6.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploring Mode Connectivity for Pre-trained Language Models Yujia Qin1 Cheng Qian1 Jing Yi1 Weize Chen1 Yankai Lin23y Xu Han1 Zhiyuan Liu145yMaosong Sun145y Jie Zhou6

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: