Bird-Eye Transformers for Text Generation Models
Lei Sha, Yuhang Song, Yordan Yordanov, Tommaso Salvatori, Thomas Lukasiewicz
Department of Computer Science, University of Oxford, Oxford, UK
firstname.lastname@cs.ox.ac.uk
Abstract
Transformers have become an indispens-
able module for text generation models
since their great success in machine trans-
lation. Previous works attribute the suc-
cess of transformers to the query-key-value
dot-product attention, which provides a ro-
bust inductive bias by the fully connected
token graphs. However, we found that self-
attention has a severe limitation. When pre-
dicting the (i+ 1)-th token, self-attention
only takes the i-th token as an informa-
tion collector, and it tends to give a high
attention weight to those tokens similar
to itself. Therefore, most of the histor-
ical information that occurred before the
i-th token is not taken into consideration.
Based on this observation, in this paper,
we propose a new architecture, called bird-
eye transformer (BET), which goes one
step further to improve the performance of
transformers by reweighting self-attention
to encourage it to focus more on impor-
tant historical information. We have con-
ducted experiments on multiple text genera-
tion tasks, including machine translation (2
datasets) and language models (3 datasets).
These experimental results show that our
proposed model achieves a better perfor-
mance than the baseline transformer archi-
tectures on all datasets. The code is released
at: https://sites.google.com/
view/bet-transformer/home.
1 Introduction
The successful application of transform-
ers (Vaswani et al.,2017) in machine trans-
lation shows that it is a much better choice for
sequence modeling than auto-regressive archi-
tectures like RNNs (Rumelhart et al.,1986) and
LSTMs (Hochreiter and Schmidhuber,1997).
The core of transformers is self-attention, which
is a dot-product-based token-by-token correlation
computation module. Compared to previous pop-
ular auto-regressive architectures, self-attention
directly builds the connection between tokens,
which also captures long-range dependencies.
However, we found that self-attention has some
severe disadvantages, namely, the self-attention
modules focus too much on the current token and
fails to provide specific attention to “high-level”
historical tokens. The “high-level” tokens refer to
the type of token that can influence other tokens
over a long distance, for example, the tokens lo-
cated closely to the to-be-predicted token in the
dependency parse tree. For example, in sentence
“...cat catch a mouse...”, if the current token is
“a” and we are predicting the token “mouse”. In
the self-attention module, “a” is taken as an infor-
mation collector to compute the attention weights
with other tokens by the dot product. This leads to
the fact that the next token “mouse” is mostly pre-
dicted by the information of the token “a”, while
some more important information that occurred
before are not being considered enough (such as
“cat” and “catch”).
To tackle the above disadvantages of self-
attention, in this paper, we propose to encourage
the attention weights to focus more on the “high-
level” tokens by some syntax guidance. We pro-
pose a novel architecture called bird-eye trans-
former (BET) to provide transformers with a bird-
eye view of all the historical tokens. BET has two
alternative architectures to achieve this goal: (1)
A syntax-guided transformer architecture, called
BET(SG): This architecture takes some syntax
hints from the dependency parse tree and use such
hints to reweight the self-attention’s dot-product
matrix. (2) A syntax-guidance-free transformer ar-
chitecture, called BET(SF): We don’t use any syn-
tax hints in this architecture. To provide a bird-
eye view, we first reweight the self-attention’s dot-
product matrix towards attending to high-level to-
kens. We achieve this using a function that de-
arXiv:2210.03985v1 [cs.CL] 8 Oct 2022