Better Pre-Training by Reducing Representation Confusion

2025-04-22 3 0 459.62KB 12 页 10玖币

侵权投诉

Haojie Zhang1, 2∗

, Mingfei Liang1∗

, Ruobing Xie1∗

, Zhenlong Sun1, Bo Zhang1, Leyu Lin1

1WeChat Search Application Department, Tencent, China

2Peking University, China

1{coldhjzhang, aesopliang, ruobingxie, richardsun, nevinzhang, goshawklin}@tencent.com

2zhanghaojie@stu.pku.edu.cn

Abstract

In this work, we revisit the Transformer-based

pre-trained language models and identify two

different types of information confusion in po-

sition encoding and model representations, re-

spectively. Firstly, we show that in the relative

position encoding, the joint modeling about

relative distances and directions brings confu-

sion between two heterogeneous information.

It may make the model unable to capture the

associative semantics of the same distance and

the opposite directions, which in turn affects

the performance of downstream tasks. Sec-

ondly, we notice the BERT with Mask Lan-

guage Modeling (MLM) pre-training objec-

tive outputs similar token representations (last

hidden states of different tokens) and head

representations (attention weights1of different

heads), which may make the diversity of in-

formation expressed by different tokens and

heads limited. Motivated by the above in-

vestigation, we propose two novel techniques

to improve pre-trained language models: De-

coupled Directional Relative Position (DDRP)

encoding and MTH2pre-training objective.

DDRP decouples the relative distance features

and the directional features in classical rel-

ative position encoding. MTH applies two

novel auxiliary regularizers besides MLM to

enlarge the dissimilarities between (a) last hid-

den states of different tokens, and (b) atten-

tion weights of different heads. These de-

signs allow the model to capture different cat-

egories of information more clearly, as a way

to alleviate information confusion in represen-

tation learning for better optimization. Ex-

tensive experiments and ablation studies on

GLUE benchmark demonstrate the effective-

ness of our proposed methods.

*Equal contribution.

"attention weights" mainly refer to the dot product be-

tween Key and Query in the self-attention module.

MTH is the abbreviation of our proposed MLM with To-

ken Cosine Differentiation (TCD) and Head Cosine Differen-

tiation (HCD) pre-training task. TCD and HCD are described

in detail in sec. 1(2) and sec.3.2.

1 Introduction

The paradigm of pre-training on large-scale cor-

pus and ﬁne-tuning on speciﬁc task datasets has

swept the entire ﬁeld of Natural Language Pro-

cessing (NLP). BERT (Devlin et al.,2018) is the

most prominent pre-trained language model, which

stacks the encoder blocks of Transformer (Vaswani

et al.,2017) and adopts MLM and Next Sentence

Prediction (NSP) pre-training tasks, achieving the

SOTA results in 2018. After that, a large num-

ber of Pre-trained Language Models (PLMs) (Liu

et al.,2019;Lan et al.,2020;Raffel et al.,2019;

Clark et al.,2020;He et al.,2021) that optimize

the Transformer structure and pre-training objec-

tives have emerged, which further improves the

performance of the pre-trained language models on

multiple downstream tasks. In this work, we iden-

tify two different types of information confusion

in language pre-training, and explore two concep-

tually simple and empirically powerful techniques

against them as follows:

(1)

Decoupled Directional Relative Position

(DDRP) Encoding

. It is well known that rela-

tive position encoding is competitive and has been

widely used in real PLMs (Shaw et al.,2018;Yang

et al.,2019;Wei et al.,2019;Raffel et al.,2019;Su

et al.,2021;He et al.,2021;Ke et al.,2021). De-

spite its great performance, we still notice relative

position encoding methods utilizes completely sep-

arate parametric vectors to encode different relative

position information, which indicates that every sin-

gle parametric vector needs to learn both distance

and directional features. We consider this paradigm

of utilizing a single parametric vector to represent

both relative distance and direction as a kind of

information confusion, and question its rationality.

Since relative distance features and the directional

features are apparently heterogeneous information

that reﬂects different aspects of positional informa-

tion, we argue that existing methods may impose

arXiv:2210.04246v2 [cs.CL] 9 Feb 2023

difﬁcult in establishing connections explicitly be-

tween parametric vectors of the same distances

and the opposite directions, which in turn result

in serious information losses in position encoding.

Inspired by this, we propose a novel Decoupled

Directional Relative Position (DDRP) encoding. In

detail, DDRP decomposes the classical relative po-

sition embedding (Shaw et al.,2018) into two em-

beddings, one storing the relative distance features

and the other storing the directional features, and

then multiply the two together explicitly to derive

the ﬁnal decoupled relative position embedding, al-

lowing originally confused distance and directional

information to be as distinguishable as possible.

(2)

Model Representation Differentiations

We analyze that there is non-negligible confusion

in the representation of pre-trained BERT, as ev-

idenced by the high consistency in last hidden

states across different tokens and attention weights

across different heads, respectively. Similar last

hidden states will introduce the anisotropic prob-

lem (Mimno and Thompson,2017), which will

bound the token vectors to a narrow representa-

tion space and thus make it more difﬁcult for the

model to capture deep semantics. Considering at-

tention weights contain rich linguistic knowledge

(Clark et al.,2019;Jawahar et al.,2019), we ar-

gue that high consistency in attention weights also

constrains the ability of the model to capture multi-

aspect information. To alleviate the representa-

tion confusion between different tokens and heads

caused by high information overlap, we propose

two novel pre-training approaches to stimulate the

potential of the pre-trained model to learn rich lin-

guistic knowledge: Token Cosine Differentiation

(TCD) objective and Head Cosine Differentiation

(HCD) objective. Speciﬁcally, TCD attempts to

broaden the dissimilarity between tokens by min-

imizing the cosine similarities between different

last hidden states. In contrast, HCD attempts to

broaden the dissimilarity between heads by min-

imizing the cosine similarities between different

attention weights. We apply TCD and HCD as

two auxiliary regularizers in MLM pre-training,

which in turn guides the model to produce more

discriminative token representations and head rep-

resentations. Formally, we deﬁne our enhanced

pre-training task as

LM with

CD and

(MTH).

Extensive experiments on the GLUE benchmark

show that DDRP achieves better results than classi-

cal relative position encoding (Shaw et al.,2018) on

almost all tasks without introducing the additional

computational overhead and consistently outper-

forms prior competitive relative position encoding

models (He et al.,2021;Ke et al.,2021). More-

over, our proposed MTH outperforms MLM by a

0.96 average GLUE score and achieves nearly 2x

pre-training speedup on BERT

BASE

. Both DDRP

and MTH are straightforward, effective, and easy

to deploy, which can be easily combined with ex-

isting pre-training objectives and various model

structures. Our contributions are summarized as

follows:

•

We propose a novel relative position encoding

named DDRP, which decouples the relative

distance and directional features, giving the

model a stronger prior knowledge, fewer pa-

rameters, and better results compared to con-

ventional coupled position encodings.

•

We analyze the trend of self-similarity of last

hidden states and attention weights during pre-

training, and propose two novel Token Cosine

Differentiation and Head Cosine Differentia-

tion objectives, motivating pre-trained Trans-

former to better capture semantics in PLMs.

•

We experimentally veriﬁed by our proposed

techniques (DDRP and MTH) that decompos-

ing heterogeneous information and extending

representation diversity can signiﬁcantly im-

prove pre-trained language models. We also

analyze the characteristics of DDRP and MTH

in detail.

2 Related Work

In recent years, pre-trained language models have

made signiﬁcant breakthroughs in the ﬁeld of NLP.

BERT (Devlin et al.,2018), which proposes MLM

and NSP pre-training objectives, is pre-trained on

large-scale unlabeled corpus and has learned bidi-

rectional representations efﬁciently. After that,

many different pre-trained models are produced,

which further improve the effectiveness of the pre-

trained models. RoBERTa (Liu et al.,2019) pro-

poses to remove the NSP task and veriﬁes through

experiments that more training steps and larger

batches can effectively improve the performance

of the downstream tasks. ALBERT (Lan et al.,

2020) proposes a Cross-Layer Parameter Sharing

technique to lower memory consumption. XL-Net

(Yang et al.,2019) proposes Permutation Language

Modeling to capture the dependencies among pre-

dicted tokens. ELECTRA (Clark et al.,2020)

adopts Replaced Token Detection (RTD) objective,

which considers the loss of all tokens instead of a

subset. TUPE (Ke et al.,2021) performers Query-

Key dot product with different parameter projec-

tions for contextual information and positional in-

formation separately and then added them up, they

also add relative position biases like T5 (Raffel

et al.,2019) on different heads to form the ﬁnal cor-

relation matrix. DEBERTA (He et al.,2021) sepa-

rately encodes the context and position information

of each token and uses the textual and positional

disentangled matrices of the words to calculate the

correlation matrix.

3 Method

In this section, we analyze in turn two different

types of information confusion that exist in the

real PLMs: (i) The paradigm of utilizing a single

parametric vector of relative position embedding

to represent both relative distance and direction.

(ii) The high similarity and overlap in model rep-

resentations. Based on above two investigations,

we propose two techniques,

ecoupled

irectional

elative

osition (

DDRP

) Encoding and

with

CD and

CD (

MTH

), respectively, to help

the PLMs alleviate information confusion and en-

hance representation clarity and diversity.

3.1 Decoupled Directional Relative Position

(DDRP) Encoding

We ﬁrst start to introduce DDRP by formulating

multi-head attention module of BERT and BERT-R

(Shaw et al.,2018). Speciﬁcally, BERT formulates

multi-head attention for a speciﬁc head as follows:

Q=HW Q, K =HW K, V =HW V,(1)

A=QKT

√d,(2)

Z=softmax (A)V, (3)

where

H∈RS×D

represents the input hidden

states;

WV∈RD×d

represent the pro-

jection matrix of Query, Key, and Value respec-

tively;

A∈RS×S

represents attention weight;

Z∈RS×d

represents the single-head output hid-

den states of self-attention module;

represents

input sequence length;

represents the dimension

of input hidden states;

represents the dimension

of single-head hidden states. Unlike BERT, which

adds the absolute position embedding to the word

embedding as the ﬁnal input of the model, BERT-R

ﬁrst applies relative position encoding. It adds rela-

tive position embedding into

in the self-attention

module of each layer to make a more interactive

inﬂuence. Its formulations are as follows:

Ai,j =QiKj+Kr

σ(i,j)T

√d,(4)

σ(i, j) = clip (i−j) + rs,(5)

where

represents Query vector at the i-th posi-

tion; Kjrepresents Key vector at the j-th position;

represents maximum relative position distance;

σ(i, j)

represents the index of relative position em-

bedding

Kr∈R2rs×d

; relative position embed-

ding for

are shared at all different heads. Note

that Shaw et al. (2018) has experimentally demon-

strated that adding relative position embedding to

the interaction between

and

gives no further

improvement in effectiveness, so the relative posi-

tion embedding in

space is eliminated in all our

experiments to reduce the computational overhead.

Compared with BERT, BERT-R models the cor-

relation between words and positions more explic-

itly, and thus further expands the expression diver-

sity between words. However, we notice that in

BERT-R, the vectors from the same distance on

both left and right sides are encoded in isolation (as

shown in Figure 1(a)), which indicates that every

single parametric vector from

is forced to main-

tain distance and direction, two different types of

information. Since it is conﬁrmed that directional

information is crucial in language modeling (Vu

et al.,2016;Fuller,2002;Shen et al.,2018), we

argue that such an approach causes unnecessary in-

formation confusion and faces several constraints:

(i) Mixing relative distance and directional informa-

tion for modeling makes information originally in

different spaces entangled, which in turn makes the

learning of parametric vectors more difﬁcult. (ii)

Dot products between word vectors and direction-

ally confused positional vectors bring unnecessary

randomness in deep bidirectional representation

models.

To alleviate the confusion of distance and direc-

tion that exists in BERT-R and allow the model to

perceive distances and directions more clearly, we

propose a novel Decoupled Directional Relative

Position (DDRP) encoding. Speciﬁcally, DDRP

decouples the relative distance and directional in-

formation and maintains them with two different

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BetterPre-TrainingbyReducingRepresentationConfusionHaojieZhang1,2,MingfeiLiang1,RuobingXie1,ZhenlongSun1,BoZhang1,LeyuLin11WeChatSearchApplicationDepartment,Tencent,China2PekingUniversity,China1{coldhjzhang,aesopliang,ruobingxie,richardsun,nevinzhang,goshawklin}@tencent.com2zhanghaojie@stu.pku.ed...

展开>> 收起<<

Better Pre-Training by Reducing Representation Confusion.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Better Pre-Training by Reducing Representation Confusion

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: