A Win-win Deal Towards Sparse and Robust Pre-trained Language Models Yuanxin Liu123 Fandong Meng5 Zheng Lin14y Jiangnan Li14 Peng Fu1 Yanan Cao14

2025-04-27 0 0 1.1MB 22 页 10玖币
侵权投诉
A Win-win Deal: Towards Sparse and Robust
Pre-trained Language Models
Yuanxin Liu1,2,3
, Fandong Meng5, Zheng Lin1,4
, Jiangnan Li1,4, Peng Fu1, Yanan Cao1,4,
Weiping Wang1, Jie Zhou5
1Institute of Information Engineering, Chinese Academy of Sciences
2MOE Key Laboratory of Computational Linguistics, Peking University
3School of Computer Science, Peking University
4School of Cyber Security, University of Chinese Academy of Sciences
5Pattern Recognition Center, WeChat AI, Tencent Inc, China
liuyuanxin@stu.pku.edu.cn, {fandongmeng,withtomzhou}@tencent.com
{linzheng,lijiangnan,fupeng,caoyanan,wangweiping}@iie.ac.cn
Abstract
Despite the remarkable success of pre-trained language models (PLMs), they still
face two challenges: First, large-scale PLMs are inefficient in terms of memory
footprint and computation. Second, on the downstream tasks, PLMs tend to rely
on the dataset bias and struggle to generalize to out-of-distribution (OOD) data.
In response to the efficiency problem, recent studies show that dense PLMs can
be replaced with sparse subnetworks without hurting the performance. Such
subnetworks can be found in three scenarios: 1) the fine-tuned PLMs, 2) the
raw PLMs and then fine-tuned in isolation, and even inside 3) PLMs without
any parameter fine-tuning. However, these results are only obtained in the in-
distribution (ID) setting. In this paper, we extend the study on PLMs subnetworks
to the OOD setting, investigating whether sparsity and robustness to dataset bias
can be achieved simultaneously. To this end, we conduct extensive experiments
with the pre-trained BERT model on three natural language understanding (NLU)
tasks. Our results demonstrate that
sparse and robust subnetworks (SRNets) can
consistently be found in BERT
, across the aforementioned three scenarios, using
different training and compression methods. Furthermore, we explore the upper
bound of SRNets using the OOD information and show that
there exist sparse and
almost unbiased BERT subnetworks
. Finally, we present 1) an analytical study
that provides insights on how to promote the efficiency of SRNets searching process
and 2) a solution to improve subnetworks’ performance at high sparsity. The code
is available at https://github.com/llyx97/sparse-and-robust-PLM.
1 Introduction
Pre-trained language models (PLMs) have enjoyed impressive success in natural language processing
(NLP) tasks. However, they still face two major problems. On the one hand, the prohibitive model
size of PLMs leads to poor efficiency in terms of memory footprint and computational cost [
12
,
49
].
On the other hand, despite being pre-trained on large-scale corpus, PLMs still tend to rely on dataset
bias [
18
,
37
,
65
,
46
], i.e., the spurious features of input examples that strongly correlate with the
Work was done when Yuanxin Liu was a graduate student of IIE, CAS.
Corresponding author: Zheng Lin.
Joint work with Pattern Recognition Center, WeChat AI, Tencent Inc, China.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05211v1 [cs.CL] 11 Oct 2022
PLM
Fine-tuned
PLM
SubNet
PLM
SubNet
Fine-tuned
SubNet
PLM
SubNet
(a) (b) (c)
fine-tuning
pruning
pruning
fine-tuning
pruning
Figure 1: Three kinds of PLM subnetworks obtained from different pruning and fine-tuning paradigms.
(a) Pruning a fine-tuned PLM. (b) Pruning the PLM and then fine-tuning the subnetwork. (c) Pruning
the PLM without fine-tuning model parameters. The obtained subnetworks are used for testing.
label, during downstream fine-tuning. These two problems pose great challenge to the real-world
deployment of PLMs, and they have triggered two separate lines of works.
In terms of the efficiency problem, some recent studies resort to sparse subnetworks as alternatives to
the dense PLMs. [
27
,
38
,
30
] compress the fine-tuned PLMs in a post-hoc fashion. [
4
,
40
,
32
,
28
]
extend the Lottery Ticket Hypothesis (LTH) [
9
] to search PLMs subnetworks that can be fine-tuned
in isolation. Taking one step further, [
66
] propose to learn task-specific subnetwork structures via
mask training [
23
,
35
], without fine-tuning any pre-trained parameter. Fig. 1 illustrates these three
paradigms. Encouragingly, the empirical evidences suggest that PLMs can indeed be replaced with
sparse subnetworks without compromising the in-distribution (ID) performance.
To address the dataset bias problem, numerous debiasing methods have been proposed. A prevailing
category of debiasing methods [
5
,
54
,
25
,
20
,
46
,
13
,
55
] adjust the importance of training examples,
in terms of training loss, according to their bias degree, so as to reduce the impact of biased examples
(examples that can be correctly classified based on the spurious features). As a result, the model is
forced to rely less on the dataset bias during training and generalizes better to OOD situations.
Although progress has been made in both directions, most existing work tackle the two problems
independently. To facilitate real-world application of PLMs, the problems of robustness and efficiency
should be addressed simultaneously. Motivated by this, we extend the study on PLM subnetwork
to the OOD scenario, investigating
whether there exist PLM subnetworks that are both sparse
and robust against dataset bias?
To answer this question, we conduct large-scale experiments
with the pre-trained BERT model [
6
] on three natural language understanding (NLU) tasks that
are widely-studied in the question of dataset bias. We consider a variety of setups including the
three pruning and fine-tuning paradigms, standard and debiasing training objectives, different model
pruning methods, and different variants of PLMs from the BERT family. Our results show that
BERT
does contain sparse and robust subnetworks (SRNets)
within certain sparsity constraint (e.g., less
than 70%), giving affirmative answer to the above question. Compared with a standard fine-tuned
BERT, SRNets exhibit comparable ID performance and remarkable OOD improvement. When it
comes to BERT model fine-tuned with debiasing method, SRNets can preserve the full model’s ID
and OOD performance with much fewer parameters. On this basis, we further explore the upper
bound of SRNets by making use of the OOD information, which reveals that
there exist sparse and
almost unbiased subnetworks, even in a standard fine-tuned BERT that is biased.
Regardless of the intriguing properties of SRNets, we find that the subnetwork searching process
still have room for improvement, based on some observations from the above experiments. First,
we study the timing to start searching SRNets during full BERT fine-tuning, and find that the entire
training and searching cost can be reduced from this perspective. Second, we refine the mask training
method with gradual sparsity increase, which is quite effective in identifying SRNets at high sparsity.
2
Our main contributions are summarized as follows:
We extend the study on PLMs subnetworks to the OOD scenario. To our knowledge, this
paper presents the first systematic study on sparsity and dataset bias robustness for PLMs.
We conduct extensive experiments to demonstrate the existence of sparse and robust BERT
subnetworks, across different pruning and fine-tuning setups. By using the OOD information,
we further reveal that there exist sparse and almost unbiased BERT subenetworks.
We present analytical studies and solutions that can help further refine the SRNets searching
process in terms of efficiency and the performance of subnetworks at high sparsity.
2 Related Work
2.1 BERT Compression
Studies on BERT compression can be divided into two classes. The first one focuses on the design of
model compression techniques, which include pruning [
15
,
38
,
11
], knowledge distillation [
44
,
50
,
24
,
31
], parameter sharing [
26
], quantization [
61
,
64
], and combining multiple techniques [
51
,
36
,
30
].
The second one, which is based on the lottery ticket hypothesis [
9
], investigates the compressibility
of BERT on different phases of the pre-training and fine-tuning paradigm. It has been shown that
BERT can be pruned to a sparse subnetwork after [
11
] and before fine-tuning [
4
,
40
,
28
,
32
,
15
],
without hurting the accuracy. Moreover, [
66
] show that directly learning subnetwork structures on the
pre-trained weights can match fine-tuning the full BERT. In this paper, we follow the second branch
of works, and extend the evaluation of BERT subnetworks to the OOD scenario.
2.2 Dataset Bias in NLP Tasks
To facilitate the development of NLP systems that truly learn the intended task solution, instead of
relying on dataset bias, many efforts have been made recently. On the one hand, challenging OOD
test sets are constructed [
18
,
37
,
65
,
46
,
1
] by eliminating the spurious correlations in the training
sets, in order to establish more strict evaluation. On the other hand, numerous debiasing methods
[
5
,
54
,
25
,
20
,
46
,
13
,
55
] are proposed to discourage the model from learning dataset bias during
training. However, few attention has been paid to the influence of pruning on the OOD generalization
ability of PLMs. This work presents a systematic study on this question.
2.3 Model Compression and Robustness
Some pioneer attempts have also been made to obtain models that are both compact and robust to
adversarial attacks [
16
,
60
,
48
,
10
,
59
] and spurious correlations [
62
,
8
]. Specially, [
59
,
8
] study the
compression and robustness question on PLM. Different from [
59
], which is based on adversarial
robustness, we focus on the spurious correlations, which is more common than the worst-case
adversarial attack. Compared with [
8
], which focus on post-hoc pruning of the standard fine-
tuned BERT, we thoroughly investigate different fine-tuning methods (standard and debiasing) and
subnetworks obtained from the three pruning and fine-tuning paradigms. A more detailed discussion
of the relation and difference between our work and previous studies on model compression and
robustness is provided in Appendix D.
3 Preliminaries
3.1 BERT Architecture and Subnetworks
BERT is composed of an embedding layer, a stack of Transformer layers [
56
] and a task-specific
classifier. Each Transformer layer has a multi-head self-attention (MHAtt) module and a feed-forward
network (FFN). MHAtt has four kinds of weight matrices, i.e., the query, key and value matrices
WQ,K,V Rdmodel×dmodel
, and the output matrix
WAO Rdmodel×dmodel
. FFN consits of two linear
layers Win Rdmodel×dFFN ,Wout RdFFN×dmodel , where dFFN is the hidden dimension of FFN.
To obtain the subnetwork of a model
f(θ)
parameterized by
θ
, we apply a binary pruning mask
m∈ {0,1}|θ|
to its weight matrices, which produces
f(mθ)
, where
is the Hadamard product.
3
For BERT, we focus on the
L
Transformer layers and the classifier. The parameters to be pruned are
θpr ={Wcls} ∪ Wl
Q,Wl
K,Wl
V,Wl
AO,Wl
in,Wl
outL
l=1, where Wcls is the classifier weights.
3.2 Pruning Methods
3.2.1 Magnitude-based Pruning
Magnitude-based pruning [
19
,
9
] zeros-out parameters with low absolute values. It is usually realized
in an iterative manner, namely, iterative magnitude pruning (IMP). IMP alternates between pruning
and training and gradually increases the sparsity of subnetworks. Specifically, a typical IMP algorithm
consists of four steps: (i) Training the full model to convergence. (ii) Pruning a fraction of parameters
with the smallest magnitude. (iii) Re-training the pruned subnetwork. (iv) Repeat (ii)-(iii) until
reaching the target sparsity. To obtain subnetworks from the pre-trained BERT, i.e., (b) and (c) in
Fig. 1, the subnetwork parameters are rewound to the pre-trained values after (iii), and (i) can be
abandoned. More details about our IMP implementations can be found in Appendix A.1.1.
3.2.2 Mask Training
Mask training treats the pruning mask
m
as trainable parameters. Following [
35
,
66
,
42
,
32
], we
achieve this through binarization in forward pass and gradient estimation in backward pass.
Each weight matrix
WRd1×d2
, which is frozen during mask training, is associated with a bianry
mask
m∈ {0,1}d1×d2
, and a real-valued mask
ˆ
mRd1×d2
. In the forward pass,
W
is replaced
with mW, where mis derived from ˆ
mthrough binarization:
mi,j =1if ˆ
mi,j φ
0otherwise (1)
where
φ
is the threshold. In the backward pass, since the binarization operation is not differentiable,
we use the straight-through estimator [
3
] to compute the gradients for
ˆ
m
using the gradients of
m
,
i.e., L
m, where Lis the loss. Then, ˆ
mis updated as ˆ
mˆ
mηL
m, where ηis the learning rate.
Following [
42
,
32
], we initialize the real-valued masks according to the magnitude of the original
weights. The complete mask training algorithm is summarized in Appendix A.1.2.
3.3 Debiasing Methods
As described in the Introduction, the debiasing methods measure the bias degree of training examples.
This is achieved by training a bias model. The inputs to the bias model are hand-crafted spurious
features based on our prior knowledge of the dataset bias (Section 4.1.3 describes the details). In this
way, the bias model mainly relies on the spurious features to make predictions, which can then serve as
a measurement of the bias degree. Specifically, given the bias model prediction
pb= (p1
b,· · · ,pK
b)
over the Kclasses, the bias degree β=pc
b, i.e., the the probability of the ground-truth class c.
Then,
β
can be used to adjust the training loss in several ways, including product-of-experts (PoE)
[
5
,
20
,
25
], example reweighting [
46
,
13
] and confidence regularization [
54
]. Here we describe the
standard cross-entropy and PoE, and the other two methods are introduced in Appendix A.2.
Standard Cross-Entropy
computes the cross-entropy between the predicted distribution
pm
and
the ground-truth one-hot distribution yas Lstd =y·log pm.
Product-of-Experts
combines the predictions of main model and bias model, i.e.,
pb
and
pm
, and
then computes the training loss as Lpoe =y·log softmax (log pm+ log pb).
3.4 Notations
Here we define some notations, which will be used in the following sections.
At
L(f(θ)): Training f(θ)with loss Lfor tsteps, where tcan be omitted for simplicity.
Pp
L(f(θ)): Pruning f(θ)using pruning method pand training loss L.
M(f(mθ)): Extracting the pruning mask of f(mθ), i.e., M(f(mθ)) = m.
4
L ∈ {Lstd,Lpoe,Lreweight,Lconfreg}
and
p∈ {imp,imp-rw,mask}
, where “imp” and “imp-
rw”denote the standard IMP and IMP with weight rewinding, as described in Section 3.2.1.
“mask” stands for mask training.
Ed(f(θ)): Evaluating f(θ)on the test data with distribution d∈ {ID,OOD}.
4 Sparse and Robust BERT Subnetworks
4.1 Experimental Setups
4.1.1 Datasets and Evaluation
Natural Language Inference
We use MNLI [
57
] as the ID dataset for NLI. MNLI is comprised of
premise-hypothesis pairs, whose relationship may be entailment,contradiction, or neutral. In MNLI
the word overlap between premise and hypothesis is strongly correlated with the entailment class. To
solve this problem, the OOD HANS dataset [37] is built so that such correlation does not hold.
Paraphrase Identification
The ID dataset for paraphrase identification is QQP
4
, which contains
question pairs that are labelled as either duplicate or non-duplicate. In QQP, high lexical overlap is
also strongly associated with the duplicate class. The OOD datasets PAWS-qqp and PAWS-wiki [
65
]
are built from sentences in Quora and Wikipedia respectively. In PAWS sentence pairs with high
word overlap have a balanced distribution over duplicate and non-duplicate.
Fact Verification
FEVER
5
[
52
] is adopted as the ID dataset of fact verification, where the task is
to assess whether a given evidence supports or refutes the claim, or whether there is not-enough-info
to reach a conclusion. The OOD dataset Fever-Symmetric (v1 and v2) [
46
] is proposed to evaluate
the influence of the claim-only bias (the label can be predicted correctly without the evidence).
For NLI and fact verification, we use Accuracy as the evaluation metric. For paraphrase identification,
we evaluate using the F1 score. More details of datasets and evaluation are shown in Appendix B.1.
4.1.2 PLM Backbone
We mainly experiment with the BERT-base-uncased model [
6
]. It has roughly 110M parameters in
total, and 84M parameters in the Transformer layers. As described in Section 3.1, we derive the
subnetworks from the Transformer layers and report sparsity levels relative to the 84M parameters.
To generalize our conclusions to other PLMs, we also consider two variants of the BERT family,
namely RoBERTa-base and BERT-large, the results of which can be found in Appendix C.5.
4.1.3 Training Details
Following [
5
], we use a simple linear classifier as the bias model. For HANS and PAWS, the spurious
features are based on the the word overlapping information between the two input text sequences.
For Fever-Symmetric, the spurious features are max-pooled word embeddings of the claim sentence.
More details about the bias model and the spurious features are presented in Appendix B.3.1.
Mask training and IMP basically use the same hyper-parameters (adopting from [
55
]) as full BERT.
An exception is longer training, because we find that good subnetworks at high sparsity levels require
more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID
dev performance, without using OOD information. All the reported results are averaged over 4 runs.
We defer training details about each dataset, and each training and pruning setup, to Appendix B.3.
4.2 Subnetworks from Fine-tuned BERT
4.2.1 Problem Formulation and Experimental Setups
Given the fine-tuned full BERT
f(θft) = AL1(f(θpt))
, where
θpt
and
θft
are the pre-trained and
fine-tuned parameters respectively, the goal is to find a subnetwork
f(mθ
0
ft) = Pp
L2(f(θft))
that
4https://www.kaggle.com/c/quora-question-pairs
5See the licence information at https://fever.ai/download/fever/license.html
5
摘要:

AWin-winDeal:TowardsSparseandRobustPre-trainedLanguageModelsYuanxinLiu1,2,3,FandongMeng5,ZhengLin1,4y,JiangnanLi1,4,PengFu1,YananCao1,4,WeipingWang1,JieZhou51InstituteofInformationEngineering,ChineseAcademyofSciences2MOEKeyLaboratoryofComputationalLinguistics,PekingUniversity3SchoolofComputerScienc...

展开>> 收起<<
A Win-win Deal Towards Sparse and Robust Pre-trained Language Models Yuanxin Liu123 Fandong Meng5 Zheng Lin14y Jiangnan Li14 Peng Fu1 Yanan Cao14.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.1MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注