Bird-Eye Transformers for Text Generation Models Lei Sha Yuhang Song Yordan Yordanov Tommaso Salvatori Thomas Lukasiewicz Department of Computer Science University of Oxford Oxford UK

2025-04-27 0 0 639.68KB 13 页 10玖币
侵权投诉
Bird-Eye Transformers for Text Generation Models
Lei Sha, Yuhang Song, Yordan Yordanov, Tommaso Salvatori, Thomas Lukasiewicz
Department of Computer Science, University of Oxford, Oxford, UK
firstname.lastname@cs.ox.ac.uk
Abstract
Transformers have become an indispens-
able module for text generation models
since their great success in machine trans-
lation. Previous works attribute the suc-
cess of transformers to the query-key-value
dot-product attention, which provides a ro-
bust inductive bias by the fully connected
token graphs. However, we found that self-
attention has a severe limitation. When pre-
dicting the (i+ 1)-th token, self-attention
only takes the i-th token as an informa-
tion collector, and it tends to give a high
attention weight to those tokens similar
to itself. Therefore, most of the histor-
ical information that occurred before the
i-th token is not taken into consideration.
Based on this observation, in this paper,
we propose a new architecture, called bird-
eye transformer (BET), which goes one
step further to improve the performance of
transformers by reweighting self-attention
to encourage it to focus more on impor-
tant historical information. We have con-
ducted experiments on multiple text genera-
tion tasks, including machine translation (2
datasets) and language models (3 datasets).
These experimental results show that our
proposed model achieves a better perfor-
mance than the baseline transformer archi-
tectures on all datasets. The code is released
at: https://sites.google.com/
view/bet-transformer/home.
1 Introduction
The successful application of transform-
ers (Vaswani et al.,2017) in machine trans-
lation shows that it is a much better choice for
sequence modeling than auto-regressive archi-
tectures like RNNs (Rumelhart et al.,1986) and
LSTMs (Hochreiter and Schmidhuber,1997).
The core of transformers is self-attention, which
is a dot-product-based token-by-token correlation
computation module. Compared to previous pop-
ular auto-regressive architectures, self-attention
directly builds the connection between tokens,
which also captures long-range dependencies.
However, we found that self-attention has some
severe disadvantages, namely, the self-attention
modules focus too much on the current token and
fails to provide specific attention to “high-level”
historical tokens. The “high-level” tokens refer to
the type of token that can influence other tokens
over a long distance, for example, the tokens lo-
cated closely to the to-be-predicted token in the
dependency parse tree. For example, in sentence
...cat catch a mouse...”, if the current token is
a” and we are predicting the token “mouse”. In
the self-attention module, “a” is taken as an infor-
mation collector to compute the attention weights
with other tokens by the dot product. This leads to
the fact that the next token “mouse” is mostly pre-
dicted by the information of the token “a”, while
some more important information that occurred
before are not being considered enough (such as
cat” and “catch”).
To tackle the above disadvantages of self-
attention, in this paper, we propose to encourage
the attention weights to focus more on the “high-
level” tokens by some syntax guidance. We pro-
pose a novel architecture called bird-eye trans-
former (BET) to provide transformers with a bird-
eye view of all the historical tokens. BET has two
alternative architectures to achieve this goal: (1)
A syntax-guided transformer architecture, called
BET(SG): This architecture takes some syntax
hints from the dependency parse tree and use such
hints to reweight the self-attention’s dot-product
matrix. (2) A syntax-guidance-free transformer ar-
chitecture, called BET(SF): We don’t use any syn-
tax hints in this architecture. To provide a bird-
eye view, we first reweight the self-attention’s dot-
product matrix towards attending to high-level to-
kens. We achieve this using a function that de-
arXiv:2210.03985v1 [cs.CL] 8 Oct 2022
cides which tokens are high-level tokens accord-
ing to the input sequence and the self-attention’s
output sequence. The architecture is expected to
induce the high-level information by itself.
In addition, we show that the attention weights
on the current token (in the dot-product matrix)
should be split to other historical tokens. This
modification will not lead to a loss of the current
token’s information, because it will be added back
in the residual connection (He et al.,2016) part of
the transformer.
Finally, the refined self-attention matrix is
obtained by applying the Softmax on the
reweighted dot-product matrix. The main contri-
butions of this paper are briefly summarized as fol-
lows:
• We point out the severe disadvantages in the
self-attention of transformers, and we report on
detailed experimental results that prove these
disadvantages.
We propose a novel bird-eye transformer archi-
tecture, which refines the self-attention in trans-
formers to provide it with a bird-eye view of the
historical information in sequence modeling.
We conduct experiments on multiple tasks and
compare against several natural baselines to
prove the effectiveness of our proposed bird-eye
transformer.
2 Background
Transformers were proposed by Vaswani et al.
(2017) for machine translation, using a series of
transformer blocks in the encoder and the de-
coder part. Each transformer block contains a
self-attention layer, a feed-forward layer, multiple
skip-connections, and layer normalizations. Self-
attention is the core component of transformers.
Given an input sequence X= (x1, . . . , xn)of to-
ken representations, we first use three linear trans-
formations to map Xinto three matrices: queries
Q, keys K, and values V. Then, self-attention is
calculated as follows:
Q=XWQ, K =XWK, V =XWV,
D=QK>/d, A(Q, K, V ) = Softmax(D)V,
(1)
where dis the vector dimension of each token’s
representation, WQ,WK, and WVare all trainable
parameters, and Dis the dot-product matrix.
Each transformer block has two sub-blocks, one
is for self-attention, and the other is a feed-forward
layer. The results of both the self-attention and
the feed-forward operation are then added together
with inputs as a residual connection (He et al.,
2016), followed by layer normalization:
X0=LayerNormX+A(Q, K, V ),(2)
H=LayerNormX0+FFL(X0),(3)
where FFL stands for feed-forward layer.
In text generation, when the transformer block
is used as decoder, we need to multiply the self-
attention matrix with a triangular mask to prevent
each token from attending to subsequent tokens.
3 Approach
3.1 Motivation
Intuitively, the self-attention module should focus
on some “high-level” tokens instead of focusing
too much on the current token.
Assume that we are predicting the (i+ 1)-th to-
ken relative to the previous tokens x0, . . . , xi. Ac-
cording to Eq. (1), the i-th row of dot-product ma-
trix Dis calculated as:
Di=1
dxiWQW>
Kx>
0,..., 1
dxiWQW>
Kx>
i,[M],...,[M],
(4)
where “[M]” stands for the masked out elements
that correspond to future tokens. The dot-product
Dis used for measuring the relevance between
tokens. Intuitively, since a token is always more
similar to itself than other tokens, the i-th item
xiWQW>
Kxiis expected to be the largest in
Eq. (4).
However, in fact, the i-th token is not necessary
to attend to itself, because in the transformer ar-
chitecture, the information of the i-th token can
be added to the collected feature vector by the
residual connection (He et al.,2016), as shown in
Eq. (2). If we mask out the diagonal attention val-
ues of the self-attention matrix and split the atten-
tion values to the historical tokens, then the his-
torical tokens will obtain more attention and the
current token’s attention is also kept by the resid-
ual connection.
3.2 Bird-Eye Transformers
Text generation models tend to use the informa-
tion of historical tokens to predict the next token.
However, the importance of historical tokens is
MatMul
Scale
Triangular Mask
SoftMax
MatMul
Q K V
H
(a)
MatMul
Scale
Triangular Mask
SoftMax
MatMul
Q K V
H
Pointer
Loss
Syntax
Hint
Syntax Hint:
Dependency
Parse Tree:
(b)
(c)
Figure 1: Comparison between the conventional self-
attention and syntax-guided self-attention module. (a)
self-attention, (b) self-attention with syntax hint, (c) the
heuristic way of obtaining syntax hint. For each token
w, the syntax hint is another token which is the ances-
tor of win the dependency tree and occurred before w
in the sentence. If such token does not exist, then the
syntax hint of wis itself.
not always the same. In a linguistic view, nat-
ural language is constructed by a set of syntax
rules (Chomsky,1956,2014), which is a tree-like
structure. So, according to the position of tokens
in a syntax tree, the tokens can be roughly divided
into two types: high-level tokens and low-level to-
kens. High-level tokens usually contain general
information of the current sentence and are able to
affect tokens that are far away from them, while
low-level tokens can only affects nearby tokens.
Therefore, paying more attention to the high-level
tokens is promising to contribute to a better pre-
diction of the next token in generation models, as
shown in some LSTM-based works (Shen et al.,
2018;Sha et al.,2018).
We would like to take one step further to pro-
pose a novel architecture, called bird-eye trans-
former (BET). This transformer architecture re-
fines the self-attention weights to encourage it to
focus on more informative historical tokens.
Syntax-guided BET Since syntax features of
the current token usually provide useful hints to
the next token, we propose to use some syntax
hints to guide the attention weights changing to-
wards the “high-level” tokens, which is named as
BET(SG). The guidance of syntax hints is con-
MatMul
Scale
Triangular Mask
SoftMax
MatMul
Q KV
Diagonal-free
Mask
FFN
Rescale
SoftMax
MatMul
V
H’
(b)
H
Input
(a)
Add & Norm
Feed Forward
Add & Norm
Bird-eye Multi-
head Attention
Figure 2: Simple illustration of syntax-guidance-free
bird-eye transformer: (a) bird-eye attention, and (b)
bird-eye transformer.
ducted by a pointer loss as is shown in Fig. 1(b).
The syntax hints are directly taken from the de-
pendency parse tree. If xt+1 is the token to be pre-
dicted, then we will take its nearest ancestor node
in the dependency parse tree which also occurs in
the left of xt+1 in the sentence as the syntax hint
of the current token xt. If no ancestor node oc-
curs in the left of xt+1, then the syntax hint is the
current token itself. Here, this syntax hint is the
“high-level” information w.r.t the token to be pre-
dicted.
The input format of the syntax hint is an one-
hot vector for each token, the “1” lies in the po-
sition where the syntax hint located. Assume that
these one-hot vectors (length n) for the ntokens
are ysRn×n, the pointer loss of each trans-
former block is:
Lp=Xyslog A(5)
The pointer loss of each transformer block are
added together and multiplies with a hyperparam-
eter λp, then it is added into the final loss function.
Syntax-guidance-free BET This architecture
tend to induce syntax hints by itself without exter-
nal signals, which is named as BET(SF) as shown
in Fig. 2. The main difference between BET(SF)
and the standard transformer lies in two places:(1)
we use the bird-eye information to reweight the
self-attention, which encourages to attend to more
informative historical tokens. The details are de-
scribed in Figure 2, and (2) we add a diagonal
mask to the self-attention matrix.
Intuitively, the output of self-attention collects
more high-level information than the input, so it
can help to decide which word of the input is a
high-level word. Given the input X, we can get
摘要:

Bird-EyeTransformersforTextGenerationModelsLeiSha,YuhangSong,YordanYordanov,TommasoSalvatori,ThomasLukasiewiczDepartmentofComputerScience,UniversityofOxford,Oxford,UKfirstname.lastname@cs.ox.ac.ukAbstractTransformershavebecomeanindispens-ablemodulefortextgenerationmodelssincetheirgreatsuccessinmachi...

展开>> 收起<<
Bird-Eye Transformers for Text Generation Models Lei Sha Yuhang Song Yordan Yordanov Tommaso Salvatori Thomas Lukasiewicz Department of Computer Science University of Oxford Oxford UK.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:639.68KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注