Better Pre-Training by Reducing Representation Confusion

2025-04-22 3 0 459.62KB 12 页 10玖币
侵权投诉
Better Pre-Training by Reducing Representation Confusion
Haojie Zhang1, 2
, Mingfei Liang1
, Ruobing Xie1
, Zhenlong Sun1, Bo Zhang1, Leyu Lin1
1WeChat Search Application Department, Tencent, China
2Peking University, China
1{coldhjzhang, aesopliang, ruobingxie, richardsun, nevinzhang, goshawklin}@tencent.com
2zhanghaojie@stu.pku.edu.cn
Abstract
In this work, we revisit the Transformer-based
pre-trained language models and identify two
different types of information confusion in po-
sition encoding and model representations, re-
spectively. Firstly, we show that in the relative
position encoding, the joint modeling about
relative distances and directions brings confu-
sion between two heterogeneous information.
It may make the model unable to capture the
associative semantics of the same distance and
the opposite directions, which in turn affects
the performance of downstream tasks. Sec-
ondly, we notice the BERT with Mask Lan-
guage Modeling (MLM) pre-training objec-
tive outputs similar token representations (last
hidden states of different tokens) and head
representations (attention weights1of different
heads), which may make the diversity of in-
formation expressed by different tokens and
heads limited. Motivated by the above in-
vestigation, we propose two novel techniques
to improve pre-trained language models: De-
coupled Directional Relative Position (DDRP)
encoding and MTH2pre-training objective.
DDRP decouples the relative distance features
and the directional features in classical rel-
ative position encoding. MTH applies two
novel auxiliary regularizers besides MLM to
enlarge the dissimilarities between (a) last hid-
den states of different tokens, and (b) atten-
tion weights of different heads. These de-
signs allow the model to capture different cat-
egories of information more clearly, as a way
to alleviate information confusion in represen-
tation learning for better optimization. Ex-
tensive experiments and ablation studies on
GLUE benchmark demonstrate the effective-
ness of our proposed methods.
*Equal contribution.
1
"attention weights" mainly refer to the dot product be-
tween Key and Query in the self-attention module.
2
MTH is the abbreviation of our proposed MLM with To-
ken Cosine Differentiation (TCD) and Head Cosine Differen-
tiation (HCD) pre-training task. TCD and HCD are described
in detail in sec. 1(2) and sec.3.2.
1 Introduction
The paradigm of pre-training on large-scale cor-
pus and fine-tuning on specific task datasets has
swept the entire field of Natural Language Pro-
cessing (NLP). BERT (Devlin et al.,2018) is the
most prominent pre-trained language model, which
stacks the encoder blocks of Transformer (Vaswani
et al.,2017) and adopts MLM and Next Sentence
Prediction (NSP) pre-training tasks, achieving the
SOTA results in 2018. After that, a large num-
ber of Pre-trained Language Models (PLMs) (Liu
et al.,2019;Lan et al.,2020;Raffel et al.,2019;
Clark et al.,2020;He et al.,2021) that optimize
the Transformer structure and pre-training objec-
tives have emerged, which further improves the
performance of the pre-trained language models on
multiple downstream tasks. In this work, we iden-
tify two different types of information confusion
in language pre-training, and explore two concep-
tually simple and empirically powerful techniques
against them as follows:
(1)
Decoupled Directional Relative Position
(DDRP) Encoding
. It is well known that rela-
tive position encoding is competitive and has been
widely used in real PLMs (Shaw et al.,2018;Yang
et al.,2019;Wei et al.,2019;Raffel et al.,2019;Su
et al.,2021;He et al.,2021;Ke et al.,2021). De-
spite its great performance, we still notice relative
position encoding methods utilizes completely sep-
arate parametric vectors to encode different relative
position information, which indicates that every sin-
gle parametric vector needs to learn both distance
and directional features. We consider this paradigm
of utilizing a single parametric vector to represent
both relative distance and direction as a kind of
information confusion, and question its rationality.
Since relative distance features and the directional
features are apparently heterogeneous information
that reflects different aspects of positional informa-
tion, we argue that existing methods may impose
arXiv:2210.04246v2 [cs.CL] 9 Feb 2023
difficult in establishing connections explicitly be-
tween parametric vectors of the same distances
and the opposite directions, which in turn result
in serious information losses in position encoding.
Inspired by this, we propose a novel Decoupled
Directional Relative Position (DDRP) encoding. In
detail, DDRP decomposes the classical relative po-
sition embedding (Shaw et al.,2018) into two em-
beddings, one storing the relative distance features
and the other storing the directional features, and
then multiply the two together explicitly to derive
the final decoupled relative position embedding, al-
lowing originally confused distance and directional
information to be as distinguishable as possible.
(2)
Model Representation Differentiations
.
We analyze that there is non-negligible confusion
in the representation of pre-trained BERT, as ev-
idenced by the high consistency in last hidden
states across different tokens and attention weights
across different heads, respectively. Similar last
hidden states will introduce the anisotropic prob-
lem (Mimno and Thompson,2017), which will
bound the token vectors to a narrow representa-
tion space and thus make it more difficult for the
model to capture deep semantics. Considering at-
tention weights contain rich linguistic knowledge
(Clark et al.,2019;Jawahar et al.,2019), we ar-
gue that high consistency in attention weights also
constrains the ability of the model to capture multi-
aspect information. To alleviate the representa-
tion confusion between different tokens and heads
caused by high information overlap, we propose
two novel pre-training approaches to stimulate the
potential of the pre-trained model to learn rich lin-
guistic knowledge: Token Cosine Differentiation
(TCD) objective and Head Cosine Differentiation
(HCD) objective. Specifically, TCD attempts to
broaden the dissimilarity between tokens by min-
imizing the cosine similarities between different
last hidden states. In contrast, HCD attempts to
broaden the dissimilarity between heads by min-
imizing the cosine similarities between different
attention weights. We apply TCD and HCD as
two auxiliary regularizers in MLM pre-training,
which in turn guides the model to produce more
discriminative token representations and head rep-
resentations. Formally, we define our enhanced
pre-training task as
M
LM with
T
CD and
H
CD
(MTH).
Extensive experiments on the GLUE benchmark
show that DDRP achieves better results than classi-
cal relative position encoding (Shaw et al.,2018) on
almost all tasks without introducing the additional
computational overhead and consistently outper-
forms prior competitive relative position encoding
models (He et al.,2021;Ke et al.,2021). More-
over, our proposed MTH outperforms MLM by a
0.96 average GLUE score and achieves nearly 2x
pre-training speedup on BERT
BASE
. Both DDRP
and MTH are straightforward, effective, and easy
to deploy, which can be easily combined with ex-
isting pre-training objectives and various model
structures. Our contributions are summarized as
follows:
We propose a novel relative position encoding
named DDRP, which decouples the relative
distance and directional features, giving the
model a stronger prior knowledge, fewer pa-
rameters, and better results compared to con-
ventional coupled position encodings.
We analyze the trend of self-similarity of last
hidden states and attention weights during pre-
training, and propose two novel Token Cosine
Differentiation and Head Cosine Differentia-
tion objectives, motivating pre-trained Trans-
former to better capture semantics in PLMs.
We experimentally verified by our proposed
techniques (DDRP and MTH) that decompos-
ing heterogeneous information and extending
representation diversity can significantly im-
prove pre-trained language models. We also
analyze the characteristics of DDRP and MTH
in detail.
2 Related Work
In recent years, pre-trained language models have
made significant breakthroughs in the field of NLP.
BERT (Devlin et al.,2018), which proposes MLM
and NSP pre-training objectives, is pre-trained on
large-scale unlabeled corpus and has learned bidi-
rectional representations efficiently. After that,
many different pre-trained models are produced,
which further improve the effectiveness of the pre-
trained models. RoBERTa (Liu et al.,2019) pro-
poses to remove the NSP task and verifies through
experiments that more training steps and larger
batches can effectively improve the performance
of the downstream tasks. ALBERT (Lan et al.,
2020) proposes a Cross-Layer Parameter Sharing
technique to lower memory consumption. XL-Net
(Yang et al.,2019) proposes Permutation Language
Modeling to capture the dependencies among pre-
dicted tokens. ELECTRA (Clark et al.,2020)
adopts Replaced Token Detection (RTD) objective,
which considers the loss of all tokens instead of a
subset. TUPE (Ke et al.,2021) performers Query-
Key dot product with different parameter projec-
tions for contextual information and positional in-
formation separately and then added them up, they
also add relative position biases like T5 (Raffel
et al.,2019) on different heads to form the final cor-
relation matrix. DEBERTA (He et al.,2021) sepa-
rately encodes the context and position information
of each token and uses the textual and positional
disentangled matrices of the words to calculate the
correlation matrix.
3 Method
In this section, we analyze in turn two different
types of information confusion that exist in the
real PLMs: (i) The paradigm of utilizing a single
parametric vector of relative position embedding
to represent both relative distance and direction.
(ii) The high similarity and overlap in model rep-
resentations. Based on above two investigations,
we propose two techniques,
D
ecoupled
D
irectional
R
elative
P
osition (
DDRP
) Encoding and
M
LM
with
T
CD and
H
CD (
MTH
), respectively, to help
the PLMs alleviate information confusion and en-
hance representation clarity and diversity.
3.1 Decoupled Directional Relative Position
(DDRP) Encoding
We first start to introduce DDRP by formulating
multi-head attention module of BERT and BERT-R
(Shaw et al.,2018). Specifically, BERT formulates
multi-head attention for a specific head as follows:
Q=HW Q, K =HW K, V =HW V,(1)
A=QKT
d,(2)
Z=softmax (A)V, (3)
where
HRS×D
represents the input hidden
states;
WQ
,
WK
,
WVRD×d
represent the pro-
jection matrix of Query, Key, and Value respec-
tively;
ARS×S
represents attention weight;
ZRS×d
represents the single-head output hid-
den states of self-attention module;
S
represents
input sequence length;
D
represents the dimension
of input hidden states;
d
represents the dimension
of single-head hidden states. Unlike BERT, which
adds the absolute position embedding to the word
embedding as the final input of the model, BERT-R
first applies relative position encoding. It adds rela-
tive position embedding into
K
in the self-attention
module of each layer to make a more interactive
influence. Its formulations are as follows:
Ai,j =QiKj+Kr
σ(i,j)T
d,(4)
σ(i, j) = clip (ij) + rs,(5)
where
Qi
represents Query vector at the i-th posi-
tion; Kjrepresents Key vector at the j-th position;
rs
represents maximum relative position distance;
σ(i, j)
represents the index of relative position em-
bedding
KrR2rs×d
; relative position embed-
ding for
K
are shared at all different heads. Note
that Shaw et al. (2018) has experimentally demon-
strated that adding relative position embedding to
the interaction between
A
and
V
gives no further
improvement in effectiveness, so the relative posi-
tion embedding in
V
space is eliminated in all our
experiments to reduce the computational overhead.
Compared with BERT, BERT-R models the cor-
relation between words and positions more explic-
itly, and thus further expands the expression diver-
sity between words. However, we notice that in
BERT-R, the vectors from the same distance on
both left and right sides are encoded in isolation (as
shown in Figure 1(a)), which indicates that every
single parametric vector from
Kr
is forced to main-
tain distance and direction, two different types of
information. Since it is confirmed that directional
information is crucial in language modeling (Vu
et al.,2016;Fuller,2002;Shen et al.,2018), we
argue that such an approach causes unnecessary in-
formation confusion and faces several constraints:
(i) Mixing relative distance and directional informa-
tion for modeling makes information originally in
different spaces entangled, which in turn makes the
learning of parametric vectors more difficult. (ii)
Dot products between word vectors and direction-
ally confused positional vectors bring unnecessary
randomness in deep bidirectional representation
models.
To alleviate the confusion of distance and direc-
tion that exists in BERT-R and allow the model to
perceive distances and directions more clearly, we
propose a novel Decoupled Directional Relative
Position (DDRP) encoding. Specifically, DDRP
decouples the relative distance and directional in-
formation and maintains them with two different
摘要:

BetterPre-TrainingbyReducingRepresentationConfusionHaojieZhang1,2,MingfeiLiang1,RuobingXie1,ZhenlongSun1,BoZhang1,LeyuLin11WeChatSearchApplicationDepartment,Tencent,China2PekingUniversity,China1{coldhjzhang,aesopliang,ruobingxie,richardsun,nevinzhang,goshawklin}@tencent.com2zhanghaojie@stu.pku.ed...

展开>> 收起<<
Better Pre-Training by Reducing Representation Confusion.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:459.62KB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注