
Syntax-guided Localized Self-attention by Constituency Syntactic
Distance
Shengyuan Hou1∗Jushi Kai1∗Haotian Xue1∗
Bingyu Zhu2Bo Yuan2Longtao Huang2Xinbing Wang1Zhouhan Lin1†
1Shanghai Jiao Tong University 2Alibaba Group
{hsyhwjsr,json.kai,xavihart}@sjtu.edu.cn
{zhubingyu.zby,qiufu.yb,kaiyang.hlt}@alibaba-inc.com
lin.zhouhan@gmail.com
Abstract
Recent works have revealed that Transform-
ers are implicitly learning the syntactic in-
formation in its lower layers from data, al-
beit is highly dependent on the quality and
scale of the training data. However, learn-
ing syntactic information from data is not nec-
essary if we can leverage an external syn-
tactic parser, which provides better parsing
quality with well-defined syntactic structures.
This could potentially improve Transformer’s
performance and sample efficiency. In this
work, we propose a syntax-guided localized
self-attention for Transformer that allows di-
rectly incorporating grammar structures from
an external constituency parser. It prohibits the
attention mechanism to overweight the gram-
matically distant tokens over close ones. Ex-
perimental results show that our model could
consistently improve translation performance
on a variety of machine translation datasets,
ranging from small to large dataset sizes, and
with different source languages. 1
1 Introduction
Although Transformer doesn’t have any inductive
bias on syntactic structures, some studies have
shown that it tends to learn syntactic information
from data in its lower layers (Tenney et al.,2019;
Goldberg,2019;Jawahar et al.,2019). Given the
pervasiveness of syntactic parsers that provides
high quality parsing results with well-defined syn-
tactic structures, Transformers may not need to
re-invent this wheel if grammar structures could be
directly incorporated into it.
Prior to Transformer (Vaswani et al.,2017), ear-
lier works have demonstrated that syntactic infor-
mation could be helpful for various NLP tasks. For
∗∗ Equal contribution.
†Zhouhan Lin is the corresponding author.
1
Our code is available at
https://github.com/
LUMIA-Group/distance_transformer
example, Levy and Goldberg (2014) introduced
dependency structure into word embeddings, and
Chen et al. (2017) uses Tree-LSTMs to process the
grammar trees for machine translation.
More recently, dependency grammar has been
successfully integrated into Transformer in various
forms. Strubell et al. (2018) improves semantic
role labelling by restricting tokens to only attend
to its dependency parent. Zhang et al. (2020) mod-
ifies BERT (Devlin et al.,2019) for named entity
recognition as well as GLUE tasks (Wang et al.,
2018) by adding an additional attention layer that
allows every token to only attend to its ancestral
tokens in the dependency parse tree. Bugliarello
and Okazaki (2020) improves machine translation
by constructing the attention weights from depen-
dency tree, while Li et al. (2021) masks out distant
nodes in the dependency tree from attention.
While the dependency grammar demonstrates
the relation between nodes, the constituency gram-
mar focuses more on how a sentence is formed
in a merging way block by block. Constituency
grammar contains more information about the
global structure of a sentence in a hierarchical way,
which we think will greatly improve global atten-
tion mechanism like self-attention in Transform-
ers. Since constituency grammar doesn’t directly
reflect grammatical relations between words and
introduces new constituent nodes, integrating it
into Transformer becomes less obvious. Ma et al.
(2019) explores different ways of utilizing con-
stituency syntax information in the Transformer
model, including positional embedding, output se-
quence, etc. Yang et al. (2020) uses dual encoders
to encode both source text and template yielded by
constituency grammar, at a cost of introducing a
large amount of parameters.
In this work, we propose a syntax-guided lo-
calized self-attention that effectively incorporates
constituency grammar into Transformers, without
introducing additional parameters. We first serial-
arXiv:2210.11759v1 [cs.CL] 21 Oct 2022