difficult in establishing connections explicitly be-
tween parametric vectors of the same distances
and the opposite directions, which in turn result
in serious information losses in position encoding.
Inspired by this, we propose a novel Decoupled
Directional Relative Position (DDRP) encoding. In
detail, DDRP decomposes the classical relative po-
sition embedding (Shaw et al.,2018) into two em-
beddings, one storing the relative distance features
and the other storing the directional features, and
then multiply the two together explicitly to derive
the final decoupled relative position embedding, al-
lowing originally confused distance and directional
information to be as distinguishable as possible.
(2)
Model Representation Differentiations
.
We analyze that there is non-negligible confusion
in the representation of pre-trained BERT, as ev-
idenced by the high consistency in last hidden
states across different tokens and attention weights
across different heads, respectively. Similar last
hidden states will introduce the anisotropic prob-
lem (Mimno and Thompson,2017), which will
bound the token vectors to a narrow representa-
tion space and thus make it more difficult for the
model to capture deep semantics. Considering at-
tention weights contain rich linguistic knowledge
(Clark et al.,2019;Jawahar et al.,2019), we ar-
gue that high consistency in attention weights also
constrains the ability of the model to capture multi-
aspect information. To alleviate the representa-
tion confusion between different tokens and heads
caused by high information overlap, we propose
two novel pre-training approaches to stimulate the
potential of the pre-trained model to learn rich lin-
guistic knowledge: Token Cosine Differentiation
(TCD) objective and Head Cosine Differentiation
(HCD) objective. Specifically, TCD attempts to
broaden the dissimilarity between tokens by min-
imizing the cosine similarities between different
last hidden states. In contrast, HCD attempts to
broaden the dissimilarity between heads by min-
imizing the cosine similarities between different
attention weights. We apply TCD and HCD as
two auxiliary regularizers in MLM pre-training,
which in turn guides the model to produce more
discriminative token representations and head rep-
resentations. Formally, we define our enhanced
pre-training task as
M
LM with
T
CD and
H
CD
(MTH).
Extensive experiments on the GLUE benchmark
show that DDRP achieves better results than classi-
cal relative position encoding (Shaw et al.,2018) on
almost all tasks without introducing the additional
computational overhead and consistently outper-
forms prior competitive relative position encoding
models (He et al.,2021;Ke et al.,2021). More-
over, our proposed MTH outperforms MLM by a
0.96 average GLUE score and achieves nearly 2x
pre-training speedup on BERT
BASE
. Both DDRP
and MTH are straightforward, effective, and easy
to deploy, which can be easily combined with ex-
isting pre-training objectives and various model
structures. Our contributions are summarized as
follows:
•
We propose a novel relative position encoding
named DDRP, which decouples the relative
distance and directional features, giving the
model a stronger prior knowledge, fewer pa-
rameters, and better results compared to con-
ventional coupled position encodings.
•
We analyze the trend of self-similarity of last
hidden states and attention weights during pre-
training, and propose two novel Token Cosine
Differentiation and Head Cosine Differentia-
tion objectives, motivating pre-trained Trans-
former to better capture semantics in PLMs.
•
We experimentally verified by our proposed
techniques (DDRP and MTH) that decompos-
ing heterogeneous information and extending
representation diversity can significantly im-
prove pre-trained language models. We also
analyze the characteristics of DDRP and MTH
in detail.
2 Related Work
In recent years, pre-trained language models have
made significant breakthroughs in the field of NLP.
BERT (Devlin et al.,2018), which proposes MLM
and NSP pre-training objectives, is pre-trained on
large-scale unlabeled corpus and has learned bidi-
rectional representations efficiently. After that,
many different pre-trained models are produced,
which further improve the effectiveness of the pre-
trained models. RoBERTa (Liu et al.,2019) pro-
poses to remove the NSP task and verifies through
experiments that more training steps and larger
batches can effectively improve the performance
of the downstream tasks. ALBERT (Lan et al.,
2020) proposes a Cross-Layer Parameter Sharing
technique to lower memory consumption. XL-Net