Multi-CLS BERT An Efficient Alternative to Traditional Ensembling Haw-Shiuan ChangyRuei-Yao Suny Amazon

2025-05-02 0 0 701.07KB 32 页 10玖币
侵权投诉
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
Haw-Shiuan ChangRuei-Yao Sun
Amazon
USA
chawshiu@amazon.com
rueiyas@amazon.com
Kathryn RicciAndrew McCallum
CICS, UMass, Amherst
140 Governors Dr., Amherst, MA, USA
kathryn.d.ricci@gmail.com
mccallum@cs.umass.edu
Abstract
Ensembling BERT models often significantly
improves accuracy, but at the cost of signif-
icantly more computation and memory foot-
print. In this work, we propose Multi-CLS
BERT, a novel ensembling method for CLS-
based prediction tasks that is almost as effi-
cient as a single BERT model. Multi-CLS
BERT uses multiple CLS tokens with a pa-
rameterization and objective that encourages
their diversity. Thus instead of fine-tuning
each BERT model in an ensemble (and run-
ning them all at test time), we need only
fine-tune our single Multi-CLS BERT model
(and run the one model at test time, ensem-
bling just the multiple final CLS embeddings).
To test its effectiveness, we build Multi-CLS
BERT on top of a state-of-the-art pretraining
method for BERT (Aroca-Ouellette and Rudz-
icz,2020). In experiments on GLUE and Su-
perGLUE we show that our Multi-CLS BERT
reliably improves both overall accuracy and
confidence estimation. When only 100 train-
ing samples are available in GLUE, the Multi-
CLS BERTBase model can even outperform the
corresponding BERTLarge model. We analyze
the behavior of our Multi-CLS BERT, show-
ing that it has many of the same characteristics
and behavior as a typical BERT 5-way ensem-
ble, but with nearly 4-times less computation
and memory.
1 Introduction
BERT (Bidirectional Encoder Representations
from Transformers) (Devlin et al.,2019) is one of
the most widely-used language model (LM) archi-
tectures for natural language understanding (NLU)
tasks. We often fine-tune the pretrained BERT or its
variants such as RoBERTa (Liu et al.,2019) so that
the LMs learn to aggregate all the contextualized
word embeddings into a single CLS embedding for
a downstream text classification task.
indicates equal contribution
The work is done while the authors were at UMass
BERT
VS
([CLS] + BERT)*5[CLS]*5 + BERT
Proposed !
Multi-CLS BERT
Classic !
5 BERT Ensemble
C
L
S
BERT
C
L
S
Fine-tuning: Once!
Inference: Once
Fine-tuning: 5 Times!
Inference: 5 Times
Figure 1: Comparison of Multi-CLS BERT and the clas-
sic BERT ensemble. Multi-CLS BERT only ensembles
the multiple CLS embeddings in one BERT encoder
rather than ensemble multiple BERT encoders with dif-
ferent parameter weights.
During fine-tuning, different initializations and
different training data orders significantly affect
BERT’s generalization performance, especially
with a small training dataset (Dodge et al.,2020;
Zhang et al.,2021a;Mosbach et al.,2021). One
simple and popular solution to the issue is to fine-
tune BERT model multiple times using different
random seeds and ensemble their predictions to
improve its accuracy and confidence estimation.
Although very effective, the memory and compu-
tational cost of ensembling a large LM is often
prohibitive (Xu et al.,2020;Liang et al.,2022).
Naturally, we would like to ask, “Is it possible to
ensemble BERT models at no extra cost?”
To answer the question, we propose
Multi-CLS
BERT
, which enjoys the benefits of ensembling
without sacrificing efficiency. Specifically, we in-
put the multiple CLS tokens to BERT and encour-
age the different CLS embeddings to aggregate the
information from different aspects of the input text.
As shown in Figure 1, the proposed
Multi-CLS
BERT
shares all the hidden states of the input text
arXiv:2210.05043v2 [cs.CL] 20 May 2023
and only ensembles different ways of aggregating
the hidden states. Since the input text is usually
much longer than the number of inputted CLS em-
beddings,
Multi-CLS BERT
is almost as efficient
as the original BERT.
Allen-Zhu and Li (2020) discovered that the key
of an effective ensembling model is the diversity of
individual models and the models trained using dif-
ferent random seeds have more diverse predictions
compared to simply using dropout (Srivastava et al.,
2014;Gal and Ghahramani,2016) or averaging the
weights of the models during training (Fort et al.,
2019). To ensure the diversity of CLS embeddings
without fine-tuning
Multi-CLS BERT
using multi-
ple seeds, we propose several novel diversification
techniques. For example, we insert different linear
layers into the transformer encoder for different
CLS tokens. Furthermore, we propose a novel re-
parametrization trick to prevent the linear layers
from learning the same weights during fine-tuning.
We test the effectiveness of these techniques
by modifying the multi-task pretraining method
proposed by Aroca-Ouellette and Rudzicz (2020),
which combines four self-supervised losses. In
our experiments, we demonstrate that the result-
ing
Multi-CLS BERT
can significantly improve the
accuracy on GLUE (Wang et al.,2019b) and Su-
perGLUE (Wang et al.,2019a), especially when
the training sizes are small. Similar to the BERT
ensemble model, we further show that multiple
CLS embeddings significantly reduce the expected
calibration error, which measures the quality of
prediction confidence, on the GLUE benchmark.
1.1 Main Contributions
We propose an efficient ensemble BERT model
that does not incur any extra computational cost
other than inserting a few CLS tokens and linear
layers into the BERT encoder. Furthermore, we
develop several diversification techniques for
pretraining and fine-tuning the proposed
Multi-
CLS BERT model.1
We improve the GLUE performance reported
in Aroca-Ouellette and Rudzicz (2020) using
a better and more stable fine-tuning protocol
and verify the effectiveness of its multi-task
pretraining methods in GLUE and SuperGLUE
with different training sizes.
1
We release our code at
https://github.com/iesl/
multicls/.
Building on the above state-of-the-art pretrain-
ing and fine-tuning for BERT, our experiments
and analyses show that
Multi-CLS BERT
signif-
icantly outperforms the BERT due to its similar-
ity to a BERT ensemble model. The comprehen-
sive ablation studies confirm the effectiveness
of our diversification techniques.
2 Method
In sections 2.1 and 2.2, we first review its state-of-
the-art pretraining method from Aroca-Ouellette
and Rudzicz (2020). In Section 2.3, we modify
one of its losses, quick thoughts (QT), to pretrain
our multiple embedding representation. In Sec-
tion 2.4, we encourage the CLS embeddings to
capture the fine-grained semantic meaning of the
input sequence by adding hard negatives during the
pretraining. To diversify the CLS embeddings, we
modify the transformer encoder in Section 2.5 and
propose a new reparametrization method during
the fine-tuning in Section 2.6.
2.1 Multi-task Pretraining
After testing many self-supervised losses, Aroca-
Ouellette and Rudzicz (2020) find that combin-
ing the masked language modeling (MLM) loss,
TFIDF loss, sentence ordering (SO) loss (Sun et al.,
2020), and quick thoughts (QT) loss (Logeswaran
and Lee,2018) could lead to the best performance.
The MLM loss is to predict the masked words
and the TFIDF loss is to predict the importance
of the words in the document. Each input text se-
quence consists of multiple sentences. They swap
the sentence orders in some input sentences and use
the CLS embedding to predict whether the order is
swapped in the SO loss. Finally, QT loss is used to
encourage the CLS embeddings of the consecutive
sequences to be similar.
To improve the state-of-the-art pretraining
method, we modify the multi-task pretraining
method by using multiple CLS embeddings to rep-
resent the input sequence and using non-immediate
consecutive sentences as the hard negative. Our
training method is illustrated in Figure 2.
2.2 Quick Thoughts Loss
Two similar sentences tend to have the same label
in a downstream application, so pretraining should
pull the CLS embeddings of these similar sentences
closer. The QT loss achieves this goal by assuming
Sent1
Sent2
BERT Transformer + Linear Layers (Sec. 2.5)
[C1]
S11
[MASK]
[C2]
[C3]
[C4]
[C5]
S12
S21
[SEP]
c11-2
h11-2
hn+31-2
c21-2
c31-2
c41-2
c51-2
h21-2
hn+21-2
hn+m+21-2
Sent4
Sent3
BERT Transformer + Linear Layers
[C1]
[C2]
[C3]
[C4]
[C5]
S41
c13-4
c23-4
c33-4
c43-4
c53-4
Sent5
Sent6
BERT Transformer + Linear Layers
[C1]
[C2]
[C3]
[C4]
[C5]
S51
c15-6
c25-6
c35-6
c45-6
c55-6
Hard Negative SequenceEasy Negative Sequences
TFIDF
Loss
Multi-CLS Quick Thoughts Loss
SO
Loss
Other Sentences in the Batch
BERT +
Linear Layers
Input Tokens
Facet embeddings
in the batch
Postive Sequence
A Sampled Sequence in a Batch
Sent5
Sent6
Positive
Pairs
Hard
Negative
Pairs
Part 3
Sent1
Sent2
Part 1
Sent3
Sent4
Part 2
CLS Embedding
Sec. 2.4
Sec. 2.1
Sec. 2.2,
2.3, 2.4
Figure 2: Our MCQT, SO, MLM, and TFIDF loss, which are a modification of multi-task pretraining proposed in
Aroca-Ouellette and Rudzicz (2020). The multi-CLS quick thought (MCQT) loss maximizes the CLS similarities
between a sequence (sentences 1 and 2) and the next sequence (sentences 3 and 4) while minimizing the CLS
similarities to other random sequences and the sequence after the next one (sentences 5 and 6). Notice that sentence
4 is inputted before sentence 3 because the sentence order is swapped for the SO loss.
consecutive text sequences are similar and encour-
aging their CLS embeddings to be similar.
Aroca-Ouellette and Rudzicz (2020) propose an
efficient way of computing QT loss in a batch by
evenly splitting each batch with size
B
into two
parts. The first part contains
B/2
text sequences
randomly sampled from the pretrained corpus, and
the second part contains each of the
B/2
sentences
that are immediately subsequent to those in the
first part. Then, for each sequence in the first part,
they use the consecutive sequence in the second
part as the positive example and the other
B/21
sequences as the negative examples. We can write
the QT loss for the sequences containing sentences
1, 2, 3, and 4 as
LQT (s12, s34) = log( exp(LogitQT
s12,s34)
Psexp(LogitQT
s12,s)),(1)
where
s
is the sentences in the second part of the
batch,
LogitQT
s12,s34= ( c12
||c12|| )Tc34
||c34||
is the
score for classifying sequence
s34
as the positive
example,
c12
||c12||
is the L2-normalized CLS embed-
ding for sentences 1 and 2. The normalization is
intended to stabilize the pretraining by limiting the
gradients’ magnitudes.
2.3 Multiple CLS Embeddings
A text sequence could have multiple facets; two se-
quences could be similar in some facets but dissim-
ilar in others, especially when the text sequences
are long. The QT loss squeezes all facets of a se-
quence into a single embedding and encourages all
facets of two consecutive sequences to be similar,
potentially causing information loss.
Some facets might better align with the goal of
a downstream application. For example, the facets
that contain more sentiment information would be
more useful for sentiment analysis. To preserve the
diverse facet information during pretraining, we
propose multi-CLS quick thoughts loss (MCQT).
The loss integrates two ways of computing the simi-
larity of two sequences. The first way computes the
cosine similarity between the most similar facets,
and the second computes the cosine similarity be-
tween the summations of all facets. We linearly
combine the two methods as the logit of the two
input sequences:
LogitMC
s12,s34=λmax
i,j (c12
i
||c12
i|| )Tc34
j
||c34
j|| +
(1 λ)( Pic12
i
|| Pic12
i|| )TPjc34
j
|| Pjc34
j|| .(2)
where
λ
is a constant hyperparameters;
c12
k
and
c34
k
are the CLS embeddings of sentences 1-2 and
sentences 3-4, respectively.
The first term only considers the most similar
facets to allow some facets to be dissimilar. Further-
more, the term implicitly diversifies CLS embed-
dings by considering each CLS embedding inde-
pendently. In contrast, the second term encourages
the CLS embeddings to work collaboratively, as in
a typical ensemble model, and also let every CLS
embedding receive gradients more evenly. Notice
that we sum the CLS embeddings before the nor-
malization so that the encoder could predict the
magnitude of each CLS embedding as its weight in
the summation.
To show that the proposed method can improve
the state-of-the-art pretraining methods, we keep
the MLM loss and TFIDF loss unchanged. For
the sentence ordering (SO) loss, we project the
K
hidden states
hc
k
into the embedding
hSO
with
the hidden state size
D
for predicting the sentence
order:
hSO =LSO(khc
k)
, where
khc
k
is the
concatenation of
K
hidden states with size
K×D
.
2.4 Hard Negative
For a large transformer-based LM, distinguishing
the next sequence from random sequences could
be easy. The LM can achieve low QT loss by out-
putting nearly identical CLS embeddings for the
sentences with the same topic while ignoring the
fine-grained semantic information (Papyan et al.,
2020). In this case, multiple CLS embeddings
might become underutilized.
The hard negative is a common method of adjust-
ing the difficulties of the contrastive learning (Bal-
dini Soares et al.,2019;Cohan et al.,2020). Our
way of collecting hard examples is illustrated in the
bottom-left block of Figure 2. To efficiently add
the hard negatives in the pretraining, we split the
batch into three parts. For each sequence in the first
part, we would use its immediate next sequence in
the second part as the positive example, use the
sequence after the next one in the third part as the
hard negative, and use all the other sequences in
the second or the third part as the easy negatives.
We select such sequence after the next one as our
hard negatives because the sequence usually share
the same topic with the input sequence but is more
likely to have different fine-grained semantic facets
compared to the immediate next sequence.
After adding the hard negative, the modified QT
loss of the three consecutive sequences becomes
[CLS0]
[C1]
S1
[C2]
[C3]
[C4]
[C5]
[CLS0]
h1
h3
hc5
h2
L4,1
L4,2
L4,3
L4,4
L4,5
BERTbase Encoder +
Linear Layers
[MASK]
……
S2
1-4 Transformer Layers
}
……
L8,1
L8,2
L8,3
L8,4
L8,5
……
hc1
hc2
hc3
hc4
9-12 Transformer Layers
}
HkMC
LO,k
GELU
Layernorm
c5
c1
c2
c3
c4
(Sec. 2.5)
Finetuning
(Sec. 2.6)
Pretraining (Sec. 2.1-2.4)
cMCFT
H1MC
LO,1
FT
H2MC
LO,2
FT
H3MC
LO,3
FT
H4MC
LO,4
FT
H5MC
LO,5
FT
CLS Embeddings
Figure 3: The architecture of Multi-CLS BERT encoder
that is built on BERTBase model. The different linear
layers are applied to the hidden states corresponding
to different CLS tokens to increase the diversity of the
resulting CLS embeddings.
LMCQT (s12, s34, s56) =
log
exp(LogitMC
s12,s34)
P
s∈{s34,...,s56,...}
exp(LogitMC
s12,s)
log
exp(LogitMC
s56,s34)
P
s∈{s34,...,s12,...}
exp(LogitMC
s56,s)
,(3)
where MCQT refer to multi-CLS quick thoughts,
{s34, ..., s56, ...}
are all the sequences in the sec-
ond and the third part, and
{s34, ..., s12, ...}
are
all the sequences in the first and the second part.
2.5 Architecture-based Diversification
Initially, we simply input multiple special CLS
tokens ([C1], ..., [CK]) after the original CLS token,
[CLS
0
], and take the corresponding hidden states
as the CLS embeddings, but we found that the
CLS embeddings quickly become almost identical
during the pretraining.
Subsequently, instead of using the same final
transformation head
HQT
for all CLS hidden states,
we use a different linear layer
LO,k
in the final head
HMC
k
to transform the hidden state
hc
k
for the
k
th
CLS. We set the bias term in
LO,k
to be the constant
0
because we want the differences between the
CLS to be dynamic and context-dependent.
Nevertheless, even though we differentiate the
resulting CLS embeddings
ck=HMC
k(hc
k)
, the
hidden states
hc
k
before the transformation usually
still collapse into almost identical embeddings.
To solve the collapsing problem, we insert multi-
ple linear layers
Ll,k
into the transformer encoder.
In Figure 3, we illustrate our encoder architecture
built on the BERT
Base
model. After the
4
th trans-
former layer, we insert the layers
L4,k
to transform
the hidden states before inputting them to the
5
th
layer. Similarly, we insert
L8,k
between the
8
th
transformer layer and
9
th transformer layer. For
BERT
Large
, we insert
Ll,k(.)
after layer 8 and layer
16. Notice that although the architecture looks
similar to the adapter (Houlsby et al.,2019) or
prefix-tuning (Li and Liang,2021), our purpose is
to diversify the CLS embeddings rather than freez-
ing parameters to save computational time.
2.6 Fine-Tuning
As shown in Figure 3, we input multiple CLS to-
kens into the BERT encoder during fine-tuning and
pool the corresponding CLS hidden states into the
single CLS embedding for each downstream task
fine-tuning in order to avoid overfitting and increas-
ing computational overhead. As a result, we can
use the same classifier architecture on top of
Multi-
CLS BERT
and BERT, which also simplifies their
comparison.
We discover that simply summing all the CLS
hidden states still usually makes the hidden states
and the inserted linear layers (e.g.,
LO,k
) almost
identical after fine-tuning. To avoid collapsing, we
aggregate the CLS hidden states by proposing a
novel re-parameterization trick:
cMCF T =X
kLF T
O,k (hc
k),(4)
where
LF T
O,k (hc
k) = (WO,k 1
KPk0WO,k0)hc
k
,
and
WO,k
is the linear weights of
LO,k
. Then, if
all the
LF T
O,k
become identical (i.e.,
k, WO,k =
1
KPk0WO,k0
),
LF T
O,k (hc
k) = 0=cMCF T
. How-
ever, gradient descent would not allow the model to
constantly output the zero vector, so
LF T
O,k
remains
different during the fine-tuning.
3 Experiments
The parameters of neural networks are more
restricted as more training samples are avail-
able (MacKay,1995) and the improvement of deep
ensemble models comes from the diversity of in-
dividual models (Fort et al.,2019), so the benefits
of ensembling are usually more obvious when the
training set size is smaller. Therefore, in addition
to using the full training dataset, we also test the
settings where the models are trained by 1k sam-
ples (Zhang et al.,2021a) or 100 samples from
each task in GLUE (Wang et al.,2019b) or Super-
GLUE (Wang et al.,2019a). Another benefit of the
1k- and 100-sampling settings is that the average
scores would be significantly influenced by most
datasets rather than by only a subset of relatively
small datasets (Card et al.,2020).
3.1 Experiment Setup
To accelerate the pretraining experiments, we ini-
tialize the weights using the pretrained BERT mod-
els (Devlin et al.,2019) and continue the pretrain-
ing using different loss functions on Wikipedia
2021 and BookCorpus (Zhu et al.,2015).
All of the methods are based on uncased BERT
as in Aroca-Ouellette and Rudzicz (2020). We
compare the following methods:
Pretrained
: The pretrained BERT model re-
leased from Devlin et al. (2019).
MTL
: Pretraining using the four losses selected
in Aroca-Ouellette and Rudzicz (2020): MLM,
QT, SO, and TFIDF. We remove the continue
learning procedure used in ERNIE (Sun et al.,
2020) because we find that simply summing all
the losses leads to better performance (see our
ablation study in Section 3.3).
Ours (K=5, λ)
: The proposed
Multi-CLS BERT
method using 5 CLS tokens. We show the re-
sults of setting
λ={0,0.1,0.5,1}
in Equation 2.
We reduce the maximal sentence length by 5 to
accommodate the extra 5 CLS tokens.
Ours (K=1)
: We set
K= 1
in our method to
verify the effectiveness of using multiple em-
beddings. During fine-tuning, the CLS embed-
ding is a linear transformation of the single facet
CLS =LO,1(hf
1).
The GLUE and SuperGLUE scores are sig-
nificantly influenced by the pretraining random
seeds (Sellam et al.,2021) and fine-tuning random
seeds (Dodge et al.,2020;Zhang et al.,2021a;Mos-
bach et al.,2021). To stably evaluate the perfor-
mance of different pretraining methods, we pretrain
models using four random seeds and fine-tune each
pretrained model using four random seeds, and re-
port the average performance on the development
set across all 16 random seeds. To further stabi-
lize the fine-tuning process and reach better perfor-
mance, we follow the fine-tuning suggestions from
Zhang et al. (2021a) and Mosbach et al. (2021), in-
cluding training longer, limiting the gradient norm,
and using Adam (Kingma and Ba,2015) with bias
term and warmup.
摘要:

Multi-CLSBERT:AnEfcientAlternativetoTraditionalEnsemblingHaw-ShiuanChangyRuei-YaoSunyAmazonUSAchawshiu@amazon.comrueiyas@amazon.comKathrynRicciAndrewMcCallumCICS,UMass,Amherst140GovernorsDr.,Amherst,MA,USAkathryn.d.ricci@gmail.commccallum@cs.umass.eduAbstractEnsemblingBERTmodelsoftensignicantly...

展开>> 收起<<
Multi-CLS BERT An Efficient Alternative to Traditional Ensembling Haw-Shiuan ChangyRuei-Yao Suny Amazon.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:701.07KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注