Bird-Eye Transformers for Text Generation Models Lei Sha Yuhang Song Yordan Yordanov Tommaso Salvatori Thomas Lukasiewicz Department of Computer Science University of Oxford Oxford UK

2025-04-27 0 0 639.68KB 13 页 10玖币

侵权投诉

Bird-Eye Transformers for Text Generation Models

Lei Sha, Yuhang Song, Yordan Yordanov, Tommaso Salvatori, Thomas Lukasiewicz

Department of Computer Science, University of Oxford, Oxford, UK

firstname.lastname@cs.ox.ac.uk

Abstract

Transformers have become an indispens-

able module for text generation models

since their great success in machine trans-

lation. Previous works attribute the suc-

cess of transformers to the query-key-value

dot-product attention, which provides a ro-

bust inductive bias by the fully connected

token graphs. However, we found that self-

attention has a severe limitation. When pre-

dicting the (i+ 1)-th token, self-attention

only takes the i-th token as an informa-

tion collector, and it tends to give a high

attention weight to those tokens similar

to itself. Therefore, most of the histor-

ical information that occurred before the

i-th token is not taken into consideration.

Based on this observation, in this paper,

we propose a new architecture, called bird-

eye transformer (BET), which goes one

step further to improve the performance of

transformers by reweighting self-attention

to encourage it to focus more on impor-

tant historical information. We have con-

ducted experiments on multiple text genera-

tion tasks, including machine translation (2

datasets) and language models (3 datasets).

These experimental results show that our

proposed model achieves a better perfor-

mance than the baseline transformer archi-

tectures on all datasets. The code is released

at: https://sites.google.com/

view/bet-transformer/home.

1 Introduction

The successful application of transform-

ers (Vaswani et al.,2017) in machine trans-

lation shows that it is a much better choice for

sequence modeling than auto-regressive archi-

tectures like RNNs (Rumelhart et al.,1986) and

LSTMs (Hochreiter and Schmidhuber,1997).

The core of transformers is self-attention, which

is a dot-product-based token-by-token correlation

computation module. Compared to previous pop-

ular auto-regressive architectures, self-attention

directly builds the connection between tokens,

which also captures long-range dependencies.

However, we found that self-attention has some

severe disadvantages, namely, the self-attention

modules focus too much on the current token and

fails to provide speciﬁc attention to “high-level”

historical tokens. The “high-level” tokens refer to

the type of token that can inﬂuence other tokens

over a long distance, for example, the tokens lo-

cated closely to the to-be-predicted token in the

dependency parse tree. For example, in sentence

“...cat catch a mouse...”, if the current token is

“a” and we are predicting the token “mouse”. In

the self-attention module, “a” is taken as an infor-

mation collector to compute the attention weights

with other tokens by the dot product. This leads to

the fact that the next token “mouse” is mostly pre-

dicted by the information of the token “a”, while

some more important information that occurred

before are not being considered enough (such as

“cat” and “catch”).

To tackle the above disadvantages of self-

attention, in this paper, we propose to encourage

the attention weights to focus more on the “high-

level” tokens by some syntax guidance. We pro-

pose a novel architecture called bird-eye trans-

former (BET) to provide transformers with a bird-

eye view of all the historical tokens. BET has two

alternative architectures to achieve this goal: (1)

A syntax-guided transformer architecture, called

BET(SG): This architecture takes some syntax

hints from the dependency parse tree and use such

hints to reweight the self-attention’s dot-product

matrix. (2) A syntax-guidance-free transformer ar-

chitecture, called BET(SF): We don’t use any syn-

tax hints in this architecture. To provide a bird-

eye view, we ﬁrst reweight the self-attention’s dot-

product matrix towards attending to high-level to-

kens. We achieve this using a function that de-

arXiv:2210.03985v1 [cs.CL] 8 Oct 2022

cides which tokens are high-level tokens accord-

ing to the input sequence and the self-attention’s

output sequence. The architecture is expected to

induce the high-level information by itself.

In addition, we show that the attention weights

on the current token (in the dot-product matrix)

should be split to other historical tokens. This

modiﬁcation will not lead to a loss of the current

token’s information, because it will be added back

in the residual connection (He et al.,2016) part of

the transformer.

Finally, the reﬁned self-attention matrix is

obtained by applying the Softmax on the

reweighted dot-product matrix. The main contri-

butions of this paper are brieﬂy summarized as fol-

lows:

• We point out the severe disadvantages in the

self-attention of transformers, and we report on

detailed experimental results that prove these

disadvantages.

• We propose a novel bird-eye transformer archi-

tecture, which reﬁnes the self-attention in trans-

formers to provide it with a bird-eye view of the

historical information in sequence modeling.

• We conduct experiments on multiple tasks and

compare against several natural baselines to

prove the effectiveness of our proposed bird-eye

transformer.

2 Background

Transformers were proposed by Vaswani et al.

(2017) for machine translation, using a series of

transformer blocks in the encoder and the de-

coder part. Each transformer block contains a

self-attention layer, a feed-forward layer, multiple

skip-connections, and layer normalizations. Self-

attention is the core component of transformers.

Given an input sequence X= (x1, . . . , xn)of to-

ken representations, we ﬁrst use three linear trans-

formations to map Xinto three matrices: queries

Q, keys K, and values V. Then, self-attention is

calculated as follows:

Q=XWQ, K =XWK, V =XWV,

D=QK>/√d, A(Q, K, V ) = Softmax(D)V,

(1)

where dis the vector dimension of each token’s

representation, WQ,WK, and WVare all trainable

parameters, and Dis the dot-product matrix.

Each transformer block has two sub-blocks, one

is for self-attention, and the other is a feed-forward

layer. The results of both the self-attention and

the feed-forward operation are then added together

with inputs as a residual connection (He et al.,

2016), followed by layer normalization:

X0=LayerNormX+A(Q, K, V ),(2)

H=LayerNormX0+FFL(X0),(3)

where FFL stands for feed-forward layer.

In text generation, when the transformer block

is used as decoder, we need to multiply the self-

attention matrix with a triangular mask to prevent

each token from attending to subsequent tokens.

3 Approach

3.1 Motivation

Intuitively, the self-attention module should focus

on some “high-level” tokens instead of focusing

too much on the current token.

Assume that we are predicting the (i+ 1)-th to-

ken relative to the previous tokens x0, . . . , xi. Ac-

cording to Eq. (1), the i-th row of dot-product ma-

trix Dis calculated as:

Di=1

√dxiWQW>

Kx>

0,..., 1

√dxiWQW>

Kx>

i,[M],...,[M],

(4)

where “[M]” stands for the masked out elements

that correspond to future tokens. The dot-product

Dis used for measuring the relevance between

tokens. Intuitively, since a token is always more

similar to itself than other tokens, the i-th item

xiWQW>

Kxiis expected to be the largest in

Eq. (4).

However, in fact, the i-th token is not necessary

to attend to itself, because in the transformer ar-

chitecture, the information of the i-th token can

be added to the collected feature vector by the

residual connection (He et al.,2016), as shown in

Eq. (2). If we mask out the diagonal attention val-

ues of the self-attention matrix and split the atten-

tion values to the historical tokens, then the his-

torical tokens will obtain more attention and the

current token’s attention is also kept by the resid-

ual connection.

3.2 Bird-Eye Transformers

Text generation models tend to use the informa-

tion of historical tokens to predict the next token.

However, the importance of historical tokens is

MatMul

Scale

Triangular Mask

SoftMax

MatMul

Q K V

(a)

MatMul

Scale

Triangular Mask

SoftMax

MatMul

Q K V

Pointer

Loss

Syntax

Hint

Syntax Hint:

Dependency

Parse Tree:

(b)

(c)

Figure 1: Comparison between the conventional self-

attention and syntax-guided self-attention module. (a)

self-attention, (b) self-attention with syntax hint, (c) the

heuristic way of obtaining syntax hint. For each token

w, the syntax hint is another token which is the ances-

tor of win the dependency tree and occurred before w

in the sentence. If such token does not exist, then the

syntax hint of wis itself.

not always the same. In a linguistic view, nat-

ural language is constructed by a set of syntax

rules (Chomsky,1956,2014), which is a tree-like

structure. So, according to the position of tokens

in a syntax tree, the tokens can be roughly divided

into two types: high-level tokens and low-level to-

kens. High-level tokens usually contain general

information of the current sentence and are able to

affect tokens that are far away from them, while

low-level tokens can only affects nearby tokens.

Therefore, paying more attention to the high-level

tokens is promising to contribute to a better pre-

diction of the next token in generation models, as

shown in some LSTM-based works (Shen et al.,

2018;Sha et al.,2018).

We would like to take one step further to pro-

pose a novel architecture, called bird-eye trans-

former (BET). This transformer architecture re-

ﬁnes the self-attention weights to encourage it to

focus on more informative historical tokens.

Syntax-guided BET Since syntax features of

the current token usually provide useful hints to

the next token, we propose to use some syntax

hints to guide the attention weights changing to-

wards the “high-level” tokens, which is named as

BET(SG). The guidance of syntax hints is con-

MatMul

Scale

Triangular Mask

SoftMax

MatMul

Q KV

Diagonal-free

Mask

FFN

Rescale

SoftMax

MatMul

H’

(b)

Input

(a)

Add & Norm

Feed Forward

Add & Norm

Bird-eye Multi-

head Attention

Figure 2: Simple illustration of syntax-guidance-free

bird-eye transformer: (a) bird-eye attention, and (b)

bird-eye transformer.

ducted by a pointer loss as is shown in Fig. 1(b).

The syntax hints are directly taken from the de-

pendency parse tree. If xt+1 is the token to be pre-

dicted, then we will take its nearest ancestor node

in the dependency parse tree which also occurs in

the left of xt+1 in the sentence as the syntax hint

of the current token xt. If no ancestor node oc-

curs in the left of xt+1, then the syntax hint is the

current token itself. Here, this syntax hint is the

“high-level” information w.r.t the token to be pre-

dicted.

The input format of the syntax hint is an one-

hot vector for each token, the “1” lies in the po-

sition where the syntax hint located. Assume that

these one-hot vectors (length n) for the ntokens

are ys∈Rn×n, the pointer loss of each trans-

former block is:

Lp=Xyslog A(5)

The pointer loss of each transformer block are

added together and multiplies with a hyperparam-

eter λp, then it is added into the ﬁnal loss function.

Syntax-guidance-free BET This architecture

tend to induce syntax hints by itself without exter-

nal signals, which is named as BET(SF) as shown

in Fig. 2. The main difference between BET(SF)

and the standard transformer lies in two places:(1)

we use the bird-eye information to reweight the

self-attention, which encourages to attend to more

informative historical tokens. The details are de-

scribed in Figure 2, and (2) we add a diagonal

mask to the self-attention matrix.

Intuitively, the output of self-attention collects

more high-level information than the input, so it

can help to decide which word of the input is a

high-level word. Given the input X, we can get

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Bird-EyeTransformersforTextGenerationModelsLeiSha,YuhangSong,YordanYordanov,TommasoSalvatori,ThomasLukasiewiczDepartmentofComputerScience,UniversityofOxford,Oxford,UKfirstname.lastname@cs.ox.ac.ukAbstractTransformershavebecomeanindispens-ablemodulefortextgenerationmodelssincetheirgreatsuccessinmachi...

展开>> 收起<<

Bird-Eye Transformers for Text Generation Models Lei Sha Yuhang Song Yordan Yordanov Tommaso Salvatori Thomas Lukasiewicz Department of Computer Science University of Oxford Oxford UK.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bird-Eye Transformers for Text Generation Models Lei Sha Yuhang Song Yordan Yordanov Tommaso Salvatori Thomas Lukasiewicz Department of Computer Science University of Oxford Oxford UK

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: