A Win-win Deal Towards Sparse and Robust Pre-trained Language Models Yuanxin Liu123 Fandong Meng5 Zheng Lin14y Jiangnan Li14 Peng Fu1 Yanan Cao14

2025-04-27 0 0 1.1MB 22 页 10玖币

侵权投诉

A Win-win Deal: Towards Sparse and Robust

Pre-trained Language Models

Yuanxin Liu1,2,3∗

, Fandong Meng5, Zheng Lin1,4†

, Jiangnan Li1,4, Peng Fu1, Yanan Cao1,4,

Weiping Wang1, Jie Zhou5

1Institute of Information Engineering, Chinese Academy of Sciences

2MOE Key Laboratory of Computational Linguistics, Peking University

3School of Computer Science, Peking University

4School of Cyber Security, University of Chinese Academy of Sciences

5Pattern Recognition Center, WeChat AI, Tencent Inc, China‡

liuyuanxin@stu.pku.edu.cn, {fandongmeng,withtomzhou}@tencent.com

{linzheng,lijiangnan,fupeng,caoyanan,wangweiping}@iie.ac.cn

Abstract

Despite the remarkable success of pre-trained language models (PLMs), they still

face two challenges: First, large-scale PLMs are inefﬁcient in terms of memory

footprint and computation. Second, on the downstream tasks, PLMs tend to rely

on the dataset bias and struggle to generalize to out-of-distribution (OOD) data.

In response to the efﬁciency problem, recent studies show that dense PLMs can

be replaced with sparse subnetworks without hurting the performance. Such

subnetworks can be found in three scenarios: 1) the ﬁne-tuned PLMs, 2) the

raw PLMs and then ﬁne-tuned in isolation, and even inside 3) PLMs without

any parameter ﬁne-tuning. However, these results are only obtained in the in-

distribution (ID) setting. In this paper, we extend the study on PLMs subnetworks

to the OOD setting, investigating whether sparsity and robustness to dataset bias

can be achieved simultaneously. To this end, we conduct extensive experiments

with the pre-trained BERT model on three natural language understanding (NLU)

tasks. Our results demonstrate that

sparse and robust subnetworks (SRNets) can

consistently be found in BERT

, across the aforementioned three scenarios, using

different training and compression methods. Furthermore, we explore the upper

bound of SRNets using the OOD information and show that

there exist sparse and

almost unbiased BERT subnetworks

. Finally, we present 1) an analytical study

that provides insights on how to promote the efﬁciency of SRNets searching process

and 2) a solution to improve subnetworks’ performance at high sparsity. The code

is available at https://github.com/llyx97/sparse-and-robust-PLM.

1 Introduction

Pre-trained language models (PLMs) have enjoyed impressive success in natural language processing

(NLP) tasks. However, they still face two major problems. On the one hand, the prohibitive model

size of PLMs leads to poor efﬁciency in terms of memory footprint and computational cost [

On the other hand, despite being pre-trained on large-scale corpus, PLMs still tend to rely on dataset

bias [

], i.e., the spurious features of input examples that strongly correlate with the

∗Work was done when Yuanxin Liu was a graduate student of IIE, CAS.

†Corresponding author: Zheng Lin.

‡Joint work with Pattern Recognition Center, WeChat AI, Tencent Inc, China.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05211v1 [cs.CL] 11 Oct 2022

PLM

Fine-tuned

PLM

SubNet

PLM

SubNet

Fine-tuned

SubNet

PLM

SubNet

(a) (b) (c)

fine-tuning

pruning

fine-tuning

pruning

Figure 1: Three kinds of PLM subnetworks obtained from different pruning and ﬁne-tuning paradigms.

(a) Pruning a ﬁne-tuned PLM. (b) Pruning the PLM and then ﬁne-tuning the subnetwork. (c) Pruning

the PLM without ﬁne-tuning model parameters. The obtained subnetworks are used for testing.

label, during downstream ﬁne-tuning. These two problems pose great challenge to the real-world

deployment of PLMs, and they have triggered two separate lines of works.

In terms of the efﬁciency problem, some recent studies resort to sparse subnetworks as alternatives to

the dense PLMs. [

] compress the ﬁne-tuned PLMs in a post-hoc fashion. [

]

extend the Lottery Ticket Hypothesis (LTH) [

] to search PLMs subnetworks that can be ﬁne-tuned

in isolation. Taking one step further, [

] propose to learn task-speciﬁc subnetwork structures via

mask training [

], without ﬁne-tuning any pre-trained parameter. Fig. 1 illustrates these three

paradigms. Encouragingly, the empirical evidences suggest that PLMs can indeed be replaced with

sparse subnetworks without compromising the in-distribution (ID) performance.

To address the dataset bias problem, numerous debiasing methods have been proposed. A prevailing

category of debiasing methods [

] adjust the importance of training examples,

in terms of training loss, according to their bias degree, so as to reduce the impact of biased examples

(examples that can be correctly classiﬁed based on the spurious features). As a result, the model is

forced to rely less on the dataset bias during training and generalizes better to OOD situations.

Although progress has been made in both directions, most existing work tackle the two problems

independently. To facilitate real-world application of PLMs, the problems of robustness and efﬁciency

should be addressed simultaneously. Motivated by this, we extend the study on PLM subnetwork

to the OOD scenario, investigating

whether there exist PLM subnetworks that are both sparse

and robust against dataset bias?

To answer this question, we conduct large-scale experiments

with the pre-trained BERT model [

] on three natural language understanding (NLU) tasks that

are widely-studied in the question of dataset bias. We consider a variety of setups including the

three pruning and ﬁne-tuning paradigms, standard and debiasing training objectives, different model

pruning methods, and different variants of PLMs from the BERT family. Our results show that

BERT

does contain sparse and robust subnetworks (SRNets)

within certain sparsity constraint (e.g., less

than 70%), giving afﬁrmative answer to the above question. Compared with a standard ﬁne-tuned

BERT, SRNets exhibit comparable ID performance and remarkable OOD improvement. When it

comes to BERT model ﬁne-tuned with debiasing method, SRNets can preserve the full model’s ID

and OOD performance with much fewer parameters. On this basis, we further explore the upper

bound of SRNets by making use of the OOD information, which reveals that

there exist sparse and

almost unbiased subnetworks, even in a standard ﬁne-tuned BERT that is biased.

Regardless of the intriguing properties of SRNets, we ﬁnd that the subnetwork searching process

still have room for improvement, based on some observations from the above experiments. First,

we study the timing to start searching SRNets during full BERT ﬁne-tuning, and ﬁnd that the entire

training and searching cost can be reduced from this perspective. Second, we reﬁne the mask training

method with gradual sparsity increase, which is quite effective in identifying SRNets at high sparsity.

Our main contributions are summarized as follows:

•

We extend the study on PLMs subnetworks to the OOD scenario. To our knowledge, this

paper presents the ﬁrst systematic study on sparsity and dataset bias robustness for PLMs.

•

We conduct extensive experiments to demonstrate the existence of sparse and robust BERT

subnetworks, across different pruning and ﬁne-tuning setups. By using the OOD information,

we further reveal that there exist sparse and almost unbiased BERT subenetworks.

•

We present analytical studies and solutions that can help further reﬁne the SRNets searching

process in terms of efﬁciency and the performance of subnetworks at high sparsity.

2 Related Work

2.1 BERT Compression

Studies on BERT compression can be divided into two classes. The ﬁrst one focuses on the design of

model compression techniques, which include pruning [

], knowledge distillation [

], parameter sharing [

], quantization [

], and combining multiple techniques [

The second one, which is based on the lottery ticket hypothesis [

], investigates the compressibility

of BERT on different phases of the pre-training and ﬁne-tuning paradigm. It has been shown that

BERT can be pruned to a sparse subnetwork after [

] and before ﬁne-tuning [

without hurting the accuracy. Moreover, [

] show that directly learning subnetwork structures on the

pre-trained weights can match ﬁne-tuning the full BERT. In this paper, we follow the second branch

of works, and extend the evaluation of BERT subnetworks to the OOD scenario.

2.2 Dataset Bias in NLP Tasks

To facilitate the development of NLP systems that truly learn the intended task solution, instead of

relying on dataset bias, many efforts have been made recently. On the one hand, challenging OOD

test sets are constructed [

] by eliminating the spurious correlations in the training

sets, in order to establish more strict evaluation. On the other hand, numerous debiasing methods

[

] are proposed to discourage the model from learning dataset bias during

training. However, few attention has been paid to the inﬂuence of pruning on the OOD generalization

ability of PLMs. This work presents a systematic study on this question.

2.3 Model Compression and Robustness

Some pioneer attempts have also been made to obtain models that are both compact and robust to

adversarial attacks [

] and spurious correlations [

]. Specially, [

] study the

compression and robustness question on PLM. Different from [

], which is based on adversarial

robustness, we focus on the spurious correlations, which is more common than the worst-case

adversarial attack. Compared with [

], which focus on post-hoc pruning of the standard ﬁne-

tuned BERT, we thoroughly investigate different ﬁne-tuning methods (standard and debiasing) and

subnetworks obtained from the three pruning and ﬁne-tuning paradigms. A more detailed discussion

of the relation and difference between our work and previous studies on model compression and

robustness is provided in Appendix D.

3 Preliminaries

3.1 BERT Architecture and Subnetworks

BERT is composed of an embedding layer, a stack of Transformer layers [

] and a task-speciﬁc

classiﬁer. Each Transformer layer has a multi-head self-attention (MHAtt) module and a feed-forward

network (FFN). MHAtt has four kinds of weight matrices, i.e., the query, key and value matrices

WQ,K,V ∈Rdmodel×dmodel

, and the output matrix

WAO ∈Rdmodel×dmodel

. FFN consits of two linear

layers Win ∈Rdmodel×dFFN ,Wout ∈RdFFN×dmodel , where dFFN is the hidden dimension of FFN.

To obtain the subnetwork of a model

f(θ)

parameterized by

, we apply a binary pruning mask

m∈ {0,1}|θ|

to its weight matrices, which produces

f(mθ)

, where



is the Hadamard product.

For BERT, we focus on the

Transformer layers and the classiﬁer. The parameters to be pruned are

θpr ={Wcls} ∪ Wl

Q,Wl

K,Wl

V,Wl

AO,Wl

in,Wl

outL

l=1, where Wcls is the classiﬁer weights.

3.2 Pruning Methods

3.2.1 Magnitude-based Pruning

Magnitude-based pruning [

] zeros-out parameters with low absolute values. It is usually realized

in an iterative manner, namely, iterative magnitude pruning (IMP). IMP alternates between pruning

and training and gradually increases the sparsity of subnetworks. Speciﬁcally, a typical IMP algorithm

consists of four steps: (i) Training the full model to convergence. (ii) Pruning a fraction of parameters

with the smallest magnitude. (iii) Re-training the pruned subnetwork. (iv) Repeat (ii)-(iii) until

reaching the target sparsity. To obtain subnetworks from the pre-trained BERT, i.e., (b) and (c) in

Fig. 1, the subnetwork parameters are rewound to the pre-trained values after (iii), and (i) can be

abandoned. More details about our IMP implementations can be found in Appendix A.1.1.

3.2.2 Mask Training

Mask training treats the pruning mask

as trainable parameters. Following [

], we

achieve this through binarization in forward pass and gradient estimation in backward pass.

Each weight matrix

W∈Rd1×d2

, which is frozen during mask training, is associated with a bianry

mask

m∈ {0,1}d1×d2

, and a real-valued mask

m∈Rd1×d2

. In the forward pass,

is replaced

with mW, where mis derived from ˆ

mthrough binarization:

mi,j =1if ˆ

mi,j ≥φ

0otherwise (1)

where

is the threshold. In the backward pass, since the binarization operation is not differentiable,

we use the straight-through estimator [

] to compute the gradients for

using the gradients of

i.e., ∂L

∂m, where Lis the loss. Then, ˆ

mis updated as ˆ

m←ˆ

m−η∂L

∂m, where ηis the learning rate.

Following [

], we initialize the real-valued masks according to the magnitude of the original

weights. The complete mask training algorithm is summarized in Appendix A.1.2.

3.3 Debiasing Methods

As described in the Introduction, the debiasing methods measure the bias degree of training examples.

This is achieved by training a bias model. The inputs to the bias model are hand-crafted spurious

features based on our prior knowledge of the dataset bias (Section 4.1.3 describes the details). In this

way, the bias model mainly relies on the spurious features to make predictions, which can then serve as

a measurement of the bias degree. Speciﬁcally, given the bias model prediction

pb= (p1

b,· · · ,pK

over the Kclasses, the bias degree β=pc

b, i.e., the the probability of the ground-truth class c.

Then,

can be used to adjust the training loss in several ways, including product-of-experts (PoE)

[

], example reweighting [

] and conﬁdence regularization [

]. Here we describe the

standard cross-entropy and PoE, and the other two methods are introduced in Appendix A.2.

Standard Cross-Entropy

computes the cross-entropy between the predicted distribution

and

the ground-truth one-hot distribution yas Lstd =−y·log pm.

Product-of-Experts

combines the predictions of main model and bias model, i.e.,

and

, and

then computes the training loss as Lpoe =−y·log softmax (log pm+ log pb).

3.4 Notations

Here we deﬁne some notations, which will be used in the following sections.

•At

L(f(θ)): Training f(θ)with loss Lfor tsteps, where tcan be omitted for simplicity.

•Pp

L(f(θ)): Pruning f(θ)using pruning method pand training loss L.

•M(f(mθ)): Extracting the pruning mask of f(mθ), i.e., M(f(mθ)) = m.

•L ∈ {Lstd,Lpoe,Lreweight,Lconfreg}

and

p∈ {imp,imp-rw,mask}

, where “imp” and “imp-

rw”denote the standard IMP and IMP with weight rewinding, as described in Section 3.2.1.

“mask” stands for mask training.

•Ed(f(θ)): Evaluating f(θ)on the test data with distribution d∈ {ID,OOD}.

4 Sparse and Robust BERT Subnetworks

4.1 Experimental Setups

4.1.1 Datasets and Evaluation

Natural Language Inference

We use MNLI [

] as the ID dataset for NLI. MNLI is comprised of

premise-hypothesis pairs, whose relationship may be entailment,contradiction, or neutral. In MNLI

the word overlap between premise and hypothesis is strongly correlated with the entailment class. To

solve this problem, the OOD HANS dataset [37] is built so that such correlation does not hold.

Paraphrase Identiﬁcation

The ID dataset for paraphrase identiﬁcation is QQP

, which contains

question pairs that are labelled as either duplicate or non-duplicate. In QQP, high lexical overlap is

also strongly associated with the duplicate class. The OOD datasets PAWS-qqp and PAWS-wiki [

]

are built from sentences in Quora and Wikipedia respectively. In PAWS sentence pairs with high

word overlap have a balanced distribution over duplicate and non-duplicate.

Fact Veriﬁcation

FEVER

[

] is adopted as the ID dataset of fact veriﬁcation, where the task is

to assess whether a given evidence supports or refutes the claim, or whether there is not-enough-info

to reach a conclusion. The OOD dataset Fever-Symmetric (v1 and v2) [

] is proposed to evaluate

the inﬂuence of the claim-only bias (the label can be predicted correctly without the evidence).

For NLI and fact veriﬁcation, we use Accuracy as the evaluation metric. For paraphrase identiﬁcation,

we evaluate using the F1 score. More details of datasets and evaluation are shown in Appendix B.1.

4.1.2 PLM Backbone

We mainly experiment with the BERT-base-uncased model [

]. It has roughly 110M parameters in

total, and 84M parameters in the Transformer layers. As described in Section 3.1, we derive the

subnetworks from the Transformer layers and report sparsity levels relative to the 84M parameters.

To generalize our conclusions to other PLMs, we also consider two variants of the BERT family,

namely RoBERTa-base and BERT-large, the results of which can be found in Appendix C.5.

4.1.3 Training Details

Following [

], we use a simple linear classiﬁer as the bias model. For HANS and PAWS, the spurious

features are based on the the word overlapping information between the two input text sequences.

For Fever-Symmetric, the spurious features are max-pooled word embeddings of the claim sentence.

More details about the bias model and the spurious features are presented in Appendix B.3.1.

Mask training and IMP basically use the same hyper-parameters (adopting from [

]) as full BERT.

An exception is longer training, because we ﬁnd that good subnetworks at high sparsity levels require

more training to be found. Unless otherwise speciﬁed, we select the best checkpoints based on the ID

dev performance, without using OOD information. All the reported results are averaged over 4 runs.

We defer training details about each dataset, and each training and pruning setup, to Appendix B.3.

4.2 Subnetworks from Fine-tuned BERT

4.2.1 Problem Formulation and Experimental Setups

Given the ﬁne-tuned full BERT

f(θft) = AL1(f(θpt))

, where

θpt

and

θft

are the pre-trained and

ﬁne-tuned parameters respectively, the goal is to ﬁnd a subnetwork

f(mθ

ft) = Pp

L2(f(θft))

that

4https://www.kaggle.com/c/quora-question-pairs

5See the licence information at https://fever.ai/download/fever/license.html

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AWin-winDeal:TowardsSparseandRobustPre-trainedLanguageModelsYuanxinLiu1,2,3,FandongMeng5,ZhengLin1,4y,JiangnanLi1,4,PengFu1,YananCao1,4,WeipingWang1,JieZhou51InstituteofInformationEngineering,ChineseAcademyofSciences2MOEKeyLaboratoryofComputationalLinguistics,PekingUniversity3SchoolofComputerScienc...

展开>> 收起<<

A Win-win Deal Towards Sparse and Robust Pre-trained Language Models Yuanxin Liu123 Fandong Meng5 Zheng Lin14y Jiangnan Li14 Peng Fu1 Yanan Cao14.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Win-win Deal Towards Sparse and Robust Pre-trained Language Models Yuanxin Liu123 Fandong Meng5 Zheng Lin14y Jiangnan Li14 Peng Fu1 Yanan Cao14

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: