Multi-CLS BERT An Efﬁcient Alternative to Traditional Ensembling Haw-Shiuan ChangyRuei-Yao Suny Amazon

2025-05-02 3 0 701.07KB 32 页 10玖币

侵权投诉

Multi-CLS BERT: An Efﬁcient Alternative to Traditional Ensembling

Haw-Shiuan Chang∗† Ruei-Yao Sun∗†

Amazon

USA

chawshiu@amazon.com

rueiyas@amazon.com

Kathryn Ricci∗Andrew McCallum

CICS, UMass, Amherst

140 Governors Dr., Amherst, MA, USA

kathryn.d.ricci@gmail.com

mccallum@cs.umass.edu

Abstract

Ensembling BERT models often signiﬁcantly

improves accuracy, but at the cost of signif-

icantly more computation and memory foot-

print. In this work, we propose Multi-CLS

BERT, a novel ensembling method for CLS-

based prediction tasks that is almost as efﬁ-

cient as a single BERT model. Multi-CLS

BERT uses multiple CLS tokens with a pa-

rameterization and objective that encourages

their diversity. Thus instead of ﬁne-tuning

each BERT model in an ensemble (and run-

ning them all at test time), we need only

ﬁne-tune our single Multi-CLS BERT model

(and run the one model at test time, ensem-

bling just the multiple ﬁnal CLS embeddings).

To test its effectiveness, we build Multi-CLS

BERT on top of a state-of-the-art pretraining

method for BERT (Aroca-Ouellette and Rudz-

icz,2020). In experiments on GLUE and Su-

perGLUE we show that our Multi-CLS BERT

reliably improves both overall accuracy and

conﬁdence estimation. When only 100 train-

ing samples are available in GLUE, the Multi-

CLS BERTBase model can even outperform the

corresponding BERTLarge model. We analyze

the behavior of our Multi-CLS BERT, show-

ing that it has many of the same characteristics

and behavior as a typical BERT 5-way ensem-

ble, but with nearly 4-times less computation

and memory.

1 Introduction

BERT (Bidirectional Encoder Representations

from Transformers) (Devlin et al.,2019) is one of

the most widely-used language model (LM) archi-

tectures for natural language understanding (NLU)

tasks. We often ﬁne-tune the pretrained BERT or its

variants such as RoBERTa (Liu et al.,2019) so that

the LMs learn to aggregate all the contextualized

word embeddings into a single CLS embedding for

a downstream text classiﬁcation task.

∗indicates equal contribution

†The work is done while the authors were at UMass

BERT

([CLS] + BERT)*5[CLS]*5 + BERT

Proposed !

Multi-CLS BERT

Classic !

5 BERT Ensemble

BERT

Fine-tuning: Once!

Inference: Once

Fine-tuning: 5 Times!

Inference: 5 Times

Figure 1: Comparison of Multi-CLS BERT and the clas-

sic BERT ensemble. Multi-CLS BERT only ensembles

the multiple CLS embeddings in one BERT encoder

rather than ensemble multiple BERT encoders with dif-

ferent parameter weights.

During ﬁne-tuning, different initializations and

different training data orders signiﬁcantly affect

BERT’s generalization performance, especially

with a small training dataset (Dodge et al.,2020;

Zhang et al.,2021a;Mosbach et al.,2021). One

simple and popular solution to the issue is to ﬁne-

tune BERT model multiple times using different

random seeds and ensemble their predictions to

improve its accuracy and conﬁdence estimation.

Although very effective, the memory and compu-

tational cost of ensembling a large LM is often

prohibitive (Xu et al.,2020;Liang et al.,2022).

Naturally, we would like to ask, “Is it possible to

ensemble BERT models at no extra cost?”

To answer the question, we propose

Multi-CLS

BERT

, which enjoys the beneﬁts of ensembling

without sacriﬁcing efﬁciency. Speciﬁcally, we in-

put the multiple CLS tokens to BERT and encour-

age the different CLS embeddings to aggregate the

information from different aspects of the input text.

As shown in Figure 1, the proposed

Multi-CLS

BERT

shares all the hidden states of the input text

arXiv:2210.05043v2 [cs.CL] 20 May 2023

and only ensembles different ways of aggregating

the hidden states. Since the input text is usually

much longer than the number of inputted CLS em-

beddings,

Multi-CLS BERT

is almost as efﬁcient

as the original BERT.

Allen-Zhu and Li (2020) discovered that the key

of an effective ensembling model is the diversity of

individual models and the models trained using dif-

ferent random seeds have more diverse predictions

compared to simply using dropout (Srivastava et al.,

2014;Gal and Ghahramani,2016) or averaging the

weights of the models during training (Fort et al.,

2019). To ensure the diversity of CLS embeddings

without ﬁne-tuning

Multi-CLS BERT

using multi-

ple seeds, we propose several novel diversiﬁcation

techniques. For example, we insert different linear

layers into the transformer encoder for different

CLS tokens. Furthermore, we propose a novel re-

parametrization trick to prevent the linear layers

from learning the same weights during ﬁne-tuning.

We test the effectiveness of these techniques

by modifying the multi-task pretraining method

proposed by Aroca-Ouellette and Rudzicz (2020),

which combines four self-supervised losses. In

our experiments, we demonstrate that the result-

ing

Multi-CLS BERT

can signiﬁcantly improve the

accuracy on GLUE (Wang et al.,2019b) and Su-

perGLUE (Wang et al.,2019a), especially when

the training sizes are small. Similar to the BERT

ensemble model, we further show that multiple

CLS embeddings signiﬁcantly reduce the expected

calibration error, which measures the quality of

prediction conﬁdence, on the GLUE benchmark.

1.1 Main Contributions

•

We propose an efﬁcient ensemble BERT model

that does not incur any extra computational cost

other than inserting a few CLS tokens and linear

layers into the BERT encoder. Furthermore, we

develop several diversiﬁcation techniques for

pretraining and ﬁne-tuning the proposed

Multi-

CLS BERT model.1

•

We improve the GLUE performance reported

in Aroca-Ouellette and Rudzicz (2020) using

a better and more stable ﬁne-tuning protocol

and verify the effectiveness of its multi-task

pretraining methods in GLUE and SuperGLUE

with different training sizes.

We release our code at

https://github.com/iesl/

multicls/.

•

Building on the above state-of-the-art pretrain-

ing and ﬁne-tuning for BERT, our experiments

and analyses show that

Multi-CLS BERT

signif-

icantly outperforms the BERT due to its similar-

ity to a BERT ensemble model. The comprehen-

sive ablation studies conﬁrm the effectiveness

of our diversiﬁcation techniques.

2 Method

In sections 2.1 and 2.2, we ﬁrst review its state-of-

the-art pretraining method from Aroca-Ouellette

and Rudzicz (2020). In Section 2.3, we modify

one of its losses, quick thoughts (QT), to pretrain

our multiple embedding representation. In Sec-

tion 2.4, we encourage the CLS embeddings to

capture the ﬁne-grained semantic meaning of the

input sequence by adding hard negatives during the

pretraining. To diversify the CLS embeddings, we

modify the transformer encoder in Section 2.5 and

propose a new reparametrization method during

the ﬁne-tuning in Section 2.6.

2.1 Multi-task Pretraining

After testing many self-supervised losses, Aroca-

Ouellette and Rudzicz (2020) ﬁnd that combin-

ing the masked language modeling (MLM) loss,

TFIDF loss, sentence ordering (SO) loss (Sun et al.,

2020), and quick thoughts (QT) loss (Logeswaran

and Lee,2018) could lead to the best performance.

The MLM loss is to predict the masked words

and the TFIDF loss is to predict the importance

of the words in the document. Each input text se-

quence consists of multiple sentences. They swap

the sentence orders in some input sentences and use

the CLS embedding to predict whether the order is

swapped in the SO loss. Finally, QT loss is used to

encourage the CLS embeddings of the consecutive

sequences to be similar.

To improve the state-of-the-art pretraining

method, we modify the multi-task pretraining

method by using multiple CLS embeddings to rep-

resent the input sequence and using non-immediate

consecutive sentences as the hard negative. Our

training method is illustrated in Figure 2.

2.2 Quick Thoughts Loss

Two similar sentences tend to have the same label

in a downstream application, so pretraining should

pull the CLS embeddings of these similar sentences

closer. The QT loss achieves this goal by assuming

Sent1

Sent2

BERT Transformer + Linear Layers (Sec. 2.5)

[CLS0]

[C1]

S11

[MASK]

…

[C2]

[C3]

[C4]

[C5]

S12

[SEP]

S21

[SEP]

…

[CLS0]

c11-2

h11-2

hn+31-2

c21-2

c31-2

c41-2

c51-2

h21-2

hn+11-2

hn+21-2

hn+m+21-2

Sent4

Sent3

BERT Transformer + Linear Layers

[C1]

[C2]

[C3]

[C4]

[C5]

…

S41

c13-4

c23-4

c33-4

c43-4

c53-4

…

Sent5

Sent6

BERT Transformer + Linear Layers

[C1]

[C2]

[C3]

[C4]

[C5]

…

S51

c15-6

c25-6

c35-6

c45-6

c55-6

…

Hard Negative SequenceEasy Negative Sequences

MLM

Loss

TFIDF

Loss

……

Multi-CLS Quick Thoughts Loss

Loss

Other Sentences in the Batch

BERT +

Linear Layers

Input Tokens

Facet embeddings

in the batch

Postive Sequence

A Sampled Sequence in a Batch

…

Sent5

Sent6

Positive

Pairs

Hard

Negative

Pairs

Part 3

Sent1

Sent2

Part 1

…

Sent3

Sent4

Part 2

…

[CLS0]

CLS Embedding

Sec. 2.4

Sec. 2.1

Sec. 2.2,

2.3, 2.4

Figure 2: Our MCQT, SO, MLM, and TFIDF loss, which are a modiﬁcation of multi-task pretraining proposed in

Aroca-Ouellette and Rudzicz (2020). The multi-CLS quick thought (MCQT) loss maximizes the CLS similarities

between a sequence (sentences 1 and 2) and the next sequence (sentences 3 and 4) while minimizing the CLS

similarities to other random sequences and the sequence after the next one (sentences 5 and 6). Notice that sentence

4 is inputted before sentence 3 because the sentence order is swapped for the SO loss.

consecutive text sequences are similar and encour-

aging their CLS embeddings to be similar.

Aroca-Ouellette and Rudzicz (2020) propose an

efﬁcient way of computing QT loss in a batch by

evenly splitting each batch with size

into two

parts. The ﬁrst part contains

B/2

text sequences

randomly sampled from the pretrained corpus, and

the second part contains each of the

B/2

sentences

that are immediately subsequent to those in the

ﬁrst part. Then, for each sequence in the ﬁrst part,

they use the consecutive sequence in the second

part as the positive example and the other

B/2−1

sequences as the negative examples. We can write

the QT loss for the sequences containing sentences

1, 2, 3, and 4 as

LQT (s1−2, s3−4) = −log( exp(LogitQT

s1−2,s3−4)

Psexp(LogitQT

s1−2,s)),(1)

where

is the sentences in the second part of the

batch,

LogitQT

s1−2,s3−4= ( c1−2

||c1−2|| )Tc3−4

||c3−4||

is the

score for classifying sequence

s3−4

as the positive

example,

c1−2

||c1−2||

is the L2-normalized CLS embed-

ding for sentences 1 and 2. The normalization is

intended to stabilize the pretraining by limiting the

gradients’ magnitudes.

2.3 Multiple CLS Embeddings

A text sequence could have multiple facets; two se-

quences could be similar in some facets but dissim-

ilar in others, especially when the text sequences

are long. The QT loss squeezes all facets of a se-

quence into a single embedding and encourages all

facets of two consecutive sequences to be similar,

potentially causing information loss.

Some facets might better align with the goal of

a downstream application. For example, the facets

that contain more sentiment information would be

more useful for sentiment analysis. To preserve the

diverse facet information during pretraining, we

propose multi-CLS quick thoughts loss (MCQT).

The loss integrates two ways of computing the simi-

larity of two sequences. The ﬁrst way computes the

cosine similarity between the most similar facets,

and the second computes the cosine similarity be-

tween the summations of all facets. We linearly

combine the two methods as the logit of the two

input sequences:

LogitMC

s1−2,s3−4=λmax

i,j (c1−2

||c1−2

i|| )Tc3−4

||c3−4

j|| +

(1 −λ)( Pic1−2

|| Pic1−2

i|| )TPjc3−4

|| Pjc3−4

j|| .(2)

where

is a constant hyperparameters;

c1−2

and

c3−4

are the CLS embeddings of sentences 1-2 and

sentences 3-4, respectively.

The ﬁrst term only considers the most similar

facets to allow some facets to be dissimilar. Further-

more, the term implicitly diversiﬁes CLS embed-

dings by considering each CLS embedding inde-

pendently. In contrast, the second term encourages

the CLS embeddings to work collaboratively, as in

a typical ensemble model, and also let every CLS

embedding receive gradients more evenly. Notice

that we sum the CLS embeddings before the nor-

malization so that the encoder could predict the

magnitude of each CLS embedding as its weight in

the summation.

To show that the proposed method can improve

the state-of-the-art pretraining methods, we keep

the MLM loss and TFIDF loss unchanged. For

the sentence ordering (SO) loss, we project the

hidden states

into the embedding

hSO

with

the hidden state size

for predicting the sentence

order:

hSO =LSO(⊕khc

, where

⊕khc

is the

concatenation of

hidden states with size

K×D

2.4 Hard Negative

For a large transformer-based LM, distinguishing

the next sequence from random sequences could

be easy. The LM can achieve low QT loss by out-

putting nearly identical CLS embeddings for the

sentences with the same topic while ignoring the

ﬁne-grained semantic information (Papyan et al.,

2020). In this case, multiple CLS embeddings

might become underutilized.

The hard negative is a common method of adjust-

ing the difﬁculties of the contrastive learning (Bal-

dini Soares et al.,2019;Cohan et al.,2020). Our

way of collecting hard examples is illustrated in the

bottom-left block of Figure 2. To efﬁciently add

the hard negatives in the pretraining, we split the

batch into three parts. For each sequence in the ﬁrst

part, we would use its immediate next sequence in

the second part as the positive example, use the

sequence after the next one in the third part as the

hard negative, and use all the other sequences in

the second or the third part as the easy negatives.

We select such sequence after the next one as our

hard negatives because the sequence usually share

the same topic with the input sequence but is more

likely to have different ﬁne-grained semantic facets

compared to the immediate next sequence.

After adding the hard negative, the modiﬁed QT

loss of the three consecutive sequences becomes

[CLS0]

[C1]

…

[C2]

[C3]

[C4]

[C5]

[CLS0]

hc5

L4,1

L4,2

L4,3

L4,4

L4,5

BERTbase Encoder +

Linear Layers

[MASK]

……

…

1-4 Transformer Layers

}

……

L8,1

L8,2

L8,3

L8,4

L8,5

……

hc1

hc2

hc3

hc4

9-12 Transformer Layers

}

HkMC

LO,k

GELU

Layernorm

(Sec. 2.5)

Finetuning

(Sec. 2.6)

Pretraining (Sec. 2.1-2.4)

cMCFT

H1MC

LO,1

H2MC

LO,2

H3MC

LO,3

H4MC

LO,4

H5MC

LO,5

CLS Embeddings

Figure 3: The architecture of Multi-CLS BERT encoder

that is built on BERTBase model. The different linear

layers are applied to the hidden states corresponding

to different CLS tokens to increase the diversity of the

resulting CLS embeddings.

LMCQT (s1−2, s3−4, s5−6) =

−log 





exp(LogitMC

s1−2,s3−4)

s∈{s3−4,...,s5−6,...}

exp(LogitMC

s1−2,s)







−log 





exp(LogitMC

s5−6,s3−4)

s∈{s3−4,...,s1−2,...}

exp(LogitMC

s5−6,s)





,(3)

where MCQT refer to multi-CLS quick thoughts,

{s3−4, ..., s5−6, ...}

are all the sequences in the sec-

ond and the third part, and

{s3−4, ..., s1−2, ...}

are

all the sequences in the ﬁrst and the second part.

2.5 Architecture-based Diversiﬁcation

Initially, we simply input multiple special CLS

tokens ([C1], ..., [CK]) after the original CLS token,

[CLS

], and take the corresponding hidden states

as the CLS embeddings, but we found that the

CLS embeddings quickly become almost identical

during the pretraining.

Subsequently, instead of using the same ﬁnal

transformation head

HQT

for all CLS hidden states,

we use a different linear layer

LO,k

in the ﬁnal head

HMC

to transform the hidden state

for the

CLS. We set the bias term in

LO,k

to be the constant

because we want the differences between the

CLS to be dynamic and context-dependent.

Nevertheless, even though we differentiate the

resulting CLS embeddings

ck=HMC

k(hc

, the

hidden states

before the transformation usually

still collapse into almost identical embeddings.

To solve the collapsing problem, we insert multi-

ple linear layers

Ll,k

into the transformer encoder.

In Figure 3, we illustrate our encoder architecture

built on the BERT

Base

model. After the

th trans-

former layer, we insert the layers

L4,k

to transform

the hidden states before inputting them to the

layer. Similarly, we insert

L8,k

between the

transformer layer and

th transformer layer. For

BERT

Large

, we insert

Ll,k(.)

after layer 8 and layer

16. Notice that although the architecture looks

similar to the adapter (Houlsby et al.,2019) or

preﬁx-tuning (Li and Liang,2021), our purpose is

to diversify the CLS embeddings rather than freez-

ing parameters to save computational time.

2.6 Fine-Tuning

As shown in Figure 3, we input multiple CLS to-

kens into the BERT encoder during ﬁne-tuning and

pool the corresponding CLS hidden states into the

single CLS embedding for each downstream task

ﬁne-tuning in order to avoid overﬁtting and increas-

ing computational overhead. As a result, we can

use the same classiﬁer architecture on top of

Multi-

CLS BERT

and BERT, which also simpliﬁes their

comparison.

We discover that simply summing all the CLS

hidden states still usually makes the hidden states

and the inserted linear layers (e.g.,

LO,k

) almost

identical after ﬁne-tuning. To avoid collapsing, we

aggregate the CLS hidden states by proposing a

novel re-parameterization trick:

cMCF T =X

kLF T

O,k (hc

k),(4)

where

LF T

O,k (hc

k) = (WO,k −1

KPk0WO,k0)hc

and

WO,k

is the linear weights of

LO,k

. Then, if

all the

LF T

O,k

become identical (i.e.,

∀k, WO,k =

KPk0WO,k0

LF T

O,k (hc

k) = 0=cMCF T

. How-

ever, gradient descent would not allow the model to

constantly output the zero vector, so

LF T

O,k

remains

different during the ﬁne-tuning.

3 Experiments

The parameters of neural networks are more

restricted as more training samples are avail-

able (MacKay,1995) and the improvement of deep

ensemble models comes from the diversity of in-

dividual models (Fort et al.,2019), so the beneﬁts

of ensembling are usually more obvious when the

training set size is smaller. Therefore, in addition

to using the full training dataset, we also test the

settings where the models are trained by 1k sam-

ples (Zhang et al.,2021a) or 100 samples from

each task in GLUE (Wang et al.,2019b) or Super-

GLUE (Wang et al.,2019a). Another beneﬁt of the

1k- and 100-sampling settings is that the average

scores would be signiﬁcantly inﬂuenced by most

datasets rather than by only a subset of relatively

small datasets (Card et al.,2020).

3.1 Experiment Setup

To accelerate the pretraining experiments, we ini-

tialize the weights using the pretrained BERT mod-

els (Devlin et al.,2019) and continue the pretrain-

ing using different loss functions on Wikipedia

2021 and BookCorpus (Zhu et al.,2015).

All of the methods are based on uncased BERT

as in Aroca-Ouellette and Rudzicz (2020). We

compare the following methods:

•Pretrained

: The pretrained BERT model re-

leased from Devlin et al. (2019).

•MTL

: Pretraining using the four losses selected

in Aroca-Ouellette and Rudzicz (2020): MLM,

QT, SO, and TFIDF. We remove the continue

learning procedure used in ERNIE (Sun et al.,

2020) because we ﬁnd that simply summing all

the losses leads to better performance (see our

ablation study in Section 3.3).

•Ours (K=5, λ)

: The proposed

Multi-CLS BERT

method using 5 CLS tokens. We show the re-

sults of setting

λ={0,0.1,0.5,1}

in Equation 2.

We reduce the maximal sentence length by 5 to

accommodate the extra 5 CLS tokens.

•Ours (K=1)

: We set

K= 1

in our method to

verify the effectiveness of using multiple em-

beddings. During ﬁne-tuning, the CLS embed-

ding is a linear transformation of the single facet

CLS =LO,1(hf

1).

The GLUE and SuperGLUE scores are sig-

niﬁcantly inﬂuenced by the pretraining random

seeds (Sellam et al.,2021) and ﬁne-tuning random

seeds (Dodge et al.,2020;Zhang et al.,2021a;Mos-

bach et al.,2021). To stably evaluate the perfor-

mance of different pretraining methods, we pretrain

models using four random seeds and ﬁne-tune each

pretrained model using four random seeds, and re-

port the average performance on the development

set across all 16 random seeds. To further stabi-

lize the ﬁne-tuning process and reach better perfor-

mance, we follow the ﬁne-tuning suggestions from

Zhang et al. (2021a) and Mosbach et al. (2021), in-

cluding training longer, limiting the gradient norm,

and using Adam (Kingma and Ba,2015) with bias

term and warmup.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-CLSBERT:AnEfcientAlternativetoTraditionalEnsemblingHaw-ShiuanChangyRuei-YaoSunyAmazonUSAchawshiu@amazon.comrueiyas@amazon.comKathrynRicciAndrewMcCallumCICS,UMass,Amherst140GovernorsDr.,Amherst,MA,USAkathryn.d.ricci@gmail.commccallum@cs.umass.eduAbstractEnsemblingBERTmodelsoftensignicantly...

展开>> 收起<<

Multi-CLS BERT An Efﬁcient Alternative to Traditional Ensembling Haw-Shiuan ChangyRuei-Yao Suny Amazon.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-CLS BERT An Efﬁcient Alternative to Traditional Ensembling Haw-Shiuan ChangyRuei-Yao Suny Amazon

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: