
and only ensembles different ways of aggregating
the hidden states. Since the input text is usually
much longer than the number of inputted CLS em-
beddings,
Multi-CLS BERT
is almost as efficient
as the original BERT.
Allen-Zhu and Li (2020) discovered that the key
of an effective ensembling model is the diversity of
individual models and the models trained using dif-
ferent random seeds have more diverse predictions
compared to simply using dropout (Srivastava et al.,
2014;Gal and Ghahramani,2016) or averaging the
weights of the models during training (Fort et al.,
2019). To ensure the diversity of CLS embeddings
without fine-tuning
Multi-CLS BERT
using multi-
ple seeds, we propose several novel diversification
techniques. For example, we insert different linear
layers into the transformer encoder for different
CLS tokens. Furthermore, we propose a novel re-
parametrization trick to prevent the linear layers
from learning the same weights during fine-tuning.
We test the effectiveness of these techniques
by modifying the multi-task pretraining method
proposed by Aroca-Ouellette and Rudzicz (2020),
which combines four self-supervised losses. In
our experiments, we demonstrate that the result-
ing
Multi-CLS BERT
can significantly improve the
accuracy on GLUE (Wang et al.,2019b) and Su-
perGLUE (Wang et al.,2019a), especially when
the training sizes are small. Similar to the BERT
ensemble model, we further show that multiple
CLS embeddings significantly reduce the expected
calibration error, which measures the quality of
prediction confidence, on the GLUE benchmark.
1.1 Main Contributions
•
We propose an efficient ensemble BERT model
that does not incur any extra computational cost
other than inserting a few CLS tokens and linear
layers into the BERT encoder. Furthermore, we
develop several diversification techniques for
pretraining and fine-tuning the proposed
Multi-
CLS BERT model.1
•
We improve the GLUE performance reported
in Aroca-Ouellette and Rudzicz (2020) using
a better and more stable fine-tuning protocol
and verify the effectiveness of its multi-task
pretraining methods in GLUE and SuperGLUE
with different training sizes.
1
We release our code at
https://github.com/iesl/
multicls/.
•
Building on the above state-of-the-art pretrain-
ing and fine-tuning for BERT, our experiments
and analyses show that
Multi-CLS BERT
signif-
icantly outperforms the BERT due to its similar-
ity to a BERT ensemble model. The comprehen-
sive ablation studies confirm the effectiveness
of our diversification techniques.
2 Method
In sections 2.1 and 2.2, we first review its state-of-
the-art pretraining method from Aroca-Ouellette
and Rudzicz (2020). In Section 2.3, we modify
one of its losses, quick thoughts (QT), to pretrain
our multiple embedding representation. In Sec-
tion 2.4, we encourage the CLS embeddings to
capture the fine-grained semantic meaning of the
input sequence by adding hard negatives during the
pretraining. To diversify the CLS embeddings, we
modify the transformer encoder in Section 2.5 and
propose a new reparametrization method during
the fine-tuning in Section 2.6.
2.1 Multi-task Pretraining
After testing many self-supervised losses, Aroca-
Ouellette and Rudzicz (2020) find that combin-
ing the masked language modeling (MLM) loss,
TFIDF loss, sentence ordering (SO) loss (Sun et al.,
2020), and quick thoughts (QT) loss (Logeswaran
and Lee,2018) could lead to the best performance.
The MLM loss is to predict the masked words
and the TFIDF loss is to predict the importance
of the words in the document. Each input text se-
quence consists of multiple sentences. They swap
the sentence orders in some input sentences and use
the CLS embedding to predict whether the order is
swapped in the SO loss. Finally, QT loss is used to
encourage the CLS embeddings of the consecutive
sequences to be similar.
To improve the state-of-the-art pretraining
method, we modify the multi-task pretraining
method by using multiple CLS embeddings to rep-
resent the input sequence and using non-immediate
consecutive sentences as the hard negative. Our
training method is illustrated in Figure 2.
2.2 Quick Thoughts Loss
Two similar sentences tend to have the same label
in a downstream application, so pretraining should
pull the CLS embeddings of these similar sentences
closer. The QT loss achieves this goal by assuming