Syntax-guided Localized Self-attention by Constituency Syntactic Distance Shengyuan Hou1Jushi Kai1Haotian Xue1

2025-05-02 0 0 445.91KB 8 页 10玖币
侵权投诉
Syntax-guided Localized Self-attention by Constituency Syntactic
Distance
Shengyuan Hou1Jushi Kai1Haotian Xue1
Bingyu Zhu2Bo Yuan2Longtao Huang2Xinbing Wang1Zhouhan Lin1
1Shanghai Jiao Tong University 2Alibaba Group
{hsyhwjsr,json.kai,xavihart}@sjtu.edu.cn
{zhubingyu.zby,qiufu.yb,kaiyang.hlt}@alibaba-inc.com
lin.zhouhan@gmail.com
Abstract
Recent works have revealed that Transform-
ers are implicitly learning the syntactic in-
formation in its lower layers from data, al-
beit is highly dependent on the quality and
scale of the training data. However, learn-
ing syntactic information from data is not nec-
essary if we can leverage an external syn-
tactic parser, which provides better parsing
quality with well-defined syntactic structures.
This could potentially improve Transformer’s
performance and sample efficiency. In this
work, we propose a syntax-guided localized
self-attention for Transformer that allows di-
rectly incorporating grammar structures from
an external constituency parser. It prohibits the
attention mechanism to overweight the gram-
matically distant tokens over close ones. Ex-
perimental results show that our model could
consistently improve translation performance
on a variety of machine translation datasets,
ranging from small to large dataset sizes, and
with different source languages. 1
1 Introduction
Although Transformer doesn’t have any inductive
bias on syntactic structures, some studies have
shown that it tends to learn syntactic information
from data in its lower layers (Tenney et al.,2019;
Goldberg,2019;Jawahar et al.,2019). Given the
pervasiveness of syntactic parsers that provides
high quality parsing results with well-defined syn-
tactic structures, Transformers may not need to
re-invent this wheel if grammar structures could be
directly incorporated into it.
Prior to Transformer (Vaswani et al.,2017), ear-
lier works have demonstrated that syntactic infor-
mation could be helpful for various NLP tasks. For
Equal contribution.
Zhouhan Lin is the corresponding author.
1
Our code is available at
https://github.com/
LUMIA-Group/distance_transformer
example, Levy and Goldberg (2014) introduced
dependency structure into word embeddings, and
Chen et al. (2017) uses Tree-LSTMs to process the
grammar trees for machine translation.
More recently, dependency grammar has been
successfully integrated into Transformer in various
forms. Strubell et al. (2018) improves semantic
role labelling by restricting tokens to only attend
to its dependency parent. Zhang et al. (2020) mod-
ifies BERT (Devlin et al.,2019) for named entity
recognition as well as GLUE tasks (Wang et al.,
2018) by adding an additional attention layer that
allows every token to only attend to its ancestral
tokens in the dependency parse tree. Bugliarello
and Okazaki (2020) improves machine translation
by constructing the attention weights from depen-
dency tree, while Li et al. (2021) masks out distant
nodes in the dependency tree from attention.
While the dependency grammar demonstrates
the relation between nodes, the constituency gram-
mar focuses more on how a sentence is formed
in a merging way block by block. Constituency
grammar contains more information about the
global structure of a sentence in a hierarchical way,
which we think will greatly improve global atten-
tion mechanism like self-attention in Transform-
ers. Since constituency grammar doesn’t directly
reflect grammatical relations between words and
introduces new constituent nodes, integrating it
into Transformer becomes less obvious. Ma et al.
(2019) explores different ways of utilizing con-
stituency syntax information in the Transformer
model, including positional embedding, output se-
quence, etc. Yang et al. (2020) uses dual encoders
to encode both source text and template yielded by
constituency grammar, at a cost of introducing a
large amount of parameters.
In this work, we propose a syntax-guided lo-
calized self-attention that effectively incorporates
constituency grammar into Transformers, without
introducing additional parameters. We first serial-
arXiv:2210.11759v1 [cs.CL] 21 Oct 2022
NP
riverthe
across
PP
swim
VP
S
I.
height
1
2
3
4
(a) Constituency Grammar Tree
I swim across the river .
4
3
2
1
4
(b) Syntactic Distance
I swim across the river .
I
swim
across
the
river
.
(c) Syntactic Local Range
Figure 1: (a) The constituency tree for the example sentence "I swim across the river.". (b) Its syntactic distances.
(c) The attention mask reflecting the syntactic local ranges of each word For example, rather than attending to the
whole sequence, "across" encourages attention towards swim,the and river while suppresses the others.
ize constituency trees through syntactic distance
(Shen et al.,2018), and then select several atten-
tion heads as grammar-aware heads in which the
attention ranges of each token are individually mod-
ulated according to their grammatical roles. the
modulated attention ranges are named as syntactic
local ranges, which prohibits the attention mech-
anism to overweight the grammatically distant to-
kens over close ones. Experimental results show
that our model could consistently improve transla-
tion performance on a variety of machine transla-
tion datasets, ranging from small to large dataset
sizes, and with different source languages.
2 Preliminary: Syntactic Distance
2.1 Definition
Syntactic distance (Shen et al.,2018) is a serialized
vector representation of constituency grammar tree
(Fig 1(a)) that is defined as:
Definition 2.1.
(
Syntactic Distance
) Given a
length
n
sentence
S= (t1, ..., tn)
and its con-
stituency grammar tree
T
, in which the height of
the lowest common ancestors of any pair of to-
kens
ti, tj
is noted as
hi
j
. The syntactic distance
D= (d1, ..., dn1)
of this sentence could be any
vector of scalars with length
n1
, which satisfies:
i, j [1, n 1],
sign(didj) = sign(hi
i+1 hj
j+1)(1)
Intuitively, syntactic distance
D
keeps the same
ranking order as the sequence of
(h1
2, h2
3, ..., hn1
n)
,
in which
hi
i+1
is the height of the lowest common
ancestors between pairs of consecutive words in
the sentence (See Fig. 1(b)).
2.2 Generation of Syntactic Distance
The syntactic distance could be generated by re-
cursively spliting the constituency tree in a top-
down manner. According to the merging order of
constituency syntactic tree, for any subtree T, the
subtrees rooted by every child node of T must be
constructed at first. Therefore, the merging of all
of T’s child nodes must take place afterwards. The
syntax distance in all the subtrees of T can be cal-
culated first, and then the maximum distance value
plus 1 is the current merging distance order.
During preprocessing, the syntactic distance is
calculated on different datasets respectively. For
each sentence, we first concatenate all BPE word
segmentations, then analyze the syntax tree struc-
ture using the Stanford corenlp toolkit (Manning
et al.,2014), and calculate the syntactic distance
according to the algorithm in Algorithm 1. When
filling in the syntactic distance between words in
BPE word segmentation, the lowest value is 0, and
finally all syntactic distances are added by 1 to be-
come a syntactic distance vector with a minimum
value of 1. In the case of multiple sentences, we
generate the syntactic distance of each sentence
respectively, and then fill in the maximum 999 be-
tween different sentences, indicating that all sen-
tences are merged at last.
3 Method
We are going to present a form of local self-
attention that dynamically controls each word’s
attention range according to its syntactic role in
the sentence (See Fig. 1(c)). Attention heads that
incorporate this localized self-attention could sig-
nificantly outweight the grammatically close tokens
over distant ones, thus incorporate the syntax infor-
mation as prior knowledge.
摘要:

Syntax-guidedLocalizedSelf-attentionbyConstituencySyntacticDistanceShengyuanHou1JushiKai1HaotianXue1BingyuZhu2BoYuan2LongtaoHuang2XinbingWang1ZhouhanLin1y1ShanghaiJiaoTongUniversity2AlibabaGroup{hsyhwjsr,json.kai,xavihart}@sjtu.edu.cn{zhubingyu.zby,qiufu.yb,kaiyang.hlt}@alibaba-inc.comlin.zhouhan...

展开>> 收起<<
Syntax-guided Localized Self-attention by Constituency Syntactic Distance Shengyuan Hou1Jushi Kai1Haotian Xue1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:445.91KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注