Syntax-guided Localized Self-attention by Constituency Syntactic Distance Shengyuan Hou1Jushi Kai1Haotian Xue1

2025-05-02 0 0 445.91KB 8 页 10玖币

侵权投诉

Syntax-guided Localized Self-attention by Constituency Syntactic

Distance

Shengyuan Hou1∗Jushi Kai1∗Haotian Xue1∗

Bingyu Zhu2Bo Yuan2Longtao Huang2Xinbing Wang1Zhouhan Lin1†

1Shanghai Jiao Tong University 2Alibaba Group

{hsyhwjsr,json.kai,xavihart}@sjtu.edu.cn

{zhubingyu.zby,qiufu.yb,kaiyang.hlt}@alibaba-inc.com

lin.zhouhan@gmail.com

Abstract

Recent works have revealed that Transform-

ers are implicitly learning the syntactic in-

formation in its lower layers from data, al-

beit is highly dependent on the quality and

scale of the training data. However, learn-

ing syntactic information from data is not nec-

essary if we can leverage an external syn-

tactic parser, which provides better parsing

quality with well-deﬁned syntactic structures.

This could potentially improve Transformer’s

performance and sample efﬁciency. In this

work, we propose a syntax-guided localized

self-attention for Transformer that allows di-

rectly incorporating grammar structures from

an external constituency parser. It prohibits the

attention mechanism to overweight the gram-

matically distant tokens over close ones. Ex-

perimental results show that our model could

consistently improve translation performance

on a variety of machine translation datasets,

ranging from small to large dataset sizes, and

with different source languages. 1

1 Introduction

Although Transformer doesn’t have any inductive

bias on syntactic structures, some studies have

shown that it tends to learn syntactic information

from data in its lower layers (Tenney et al.,2019;

Goldberg,2019;Jawahar et al.,2019). Given the

pervasiveness of syntactic parsers that provides

high quality parsing results with well-deﬁned syn-

tactic structures, Transformers may not need to

re-invent this wheel if grammar structures could be

directly incorporated into it.

Prior to Transformer (Vaswani et al.,2017), ear-

lier works have demonstrated that syntactic infor-

mation could be helpful for various NLP tasks. For

∗∗ Equal contribution.

†Zhouhan Lin is the corresponding author.

Our code is available at

https://github.com/

LUMIA-Group/distance_transformer

example, Levy and Goldberg (2014) introduced

dependency structure into word embeddings, and

Chen et al. (2017) uses Tree-LSTMs to process the

grammar trees for machine translation.

More recently, dependency grammar has been

successfully integrated into Transformer in various

forms. Strubell et al. (2018) improves semantic

role labelling by restricting tokens to only attend

to its dependency parent. Zhang et al. (2020) mod-

iﬁes BERT (Devlin et al.,2019) for named entity

recognition as well as GLUE tasks (Wang et al.,

2018) by adding an additional attention layer that

allows every token to only attend to its ancestral

tokens in the dependency parse tree. Bugliarello

and Okazaki (2020) improves machine translation

by constructing the attention weights from depen-

dency tree, while Li et al. (2021) masks out distant

nodes in the dependency tree from attention.

While the dependency grammar demonstrates

the relation between nodes, the constituency gram-

mar focuses more on how a sentence is formed

in a merging way block by block. Constituency

grammar contains more information about the

global structure of a sentence in a hierarchical way,

which we think will greatly improve global atten-

tion mechanism like self-attention in Transform-

ers. Since constituency grammar doesn’t directly

reﬂect grammatical relations between words and

introduces new constituent nodes, integrating it

into Transformer becomes less obvious. Ma et al.

(2019) explores different ways of utilizing con-

stituency syntax information in the Transformer

model, including positional embedding, output se-

quence, etc. Yang et al. (2020) uses dual encoders

to encode both source text and template yielded by

constituency grammar, at a cost of introducing a

large amount of parameters.

In this work, we propose a syntax-guided lo-

calized self-attention that effectively incorporates

constituency grammar into Transformers, without

introducing additional parameters. We ﬁrst serial-

arXiv:2210.11759v1 [cs.CL] 21 Oct 2022

riverthe

across

swim

height

(a) Constituency Grammar Tree

I swim across the river .

(b) Syntactic Distance

I swim across the river .

swim

across

the

river

Figure 1: (a) The constituency tree for the example sentence "I swim across the river.". (b) Its syntactic distances.

whole sequence, "across" encourages attention towards swim,the and river while suppresses the others.

ize constituency trees through syntactic distance

(Shen et al.,2018), and then select several atten-

tion heads as grammar-aware heads in which the

attention ranges of each token are individually mod-

ulated according to their grammatical roles. the

modulated attention ranges are named as syntactic

local ranges, which prohibits the attention mech-

anism to overweight the grammatically distant to-

kens over close ones. Experimental results show

that our model could consistently improve transla-

tion performance on a variety of machine transla-

tion datasets, ranging from small to large dataset

sizes, and with different source languages.

2 Preliminary: Syntactic Distance

2.1 Deﬁnition

Syntactic distance (Shen et al.,2018) is a serialized

vector representation of constituency grammar tree

(Fig 1(a)) that is deﬁned as:

Deﬁnition 2.1.

(

Syntactic Distance

) Given a

length

sentence

S= (t1, ..., tn)

and its con-

stituency grammar tree

, in which the height of

the lowest common ancestors of any pair of to-

kens

ti, tj

is noted as

. The syntactic distance

D= (d1, ..., dn−1)

of this sentence could be any

vector of scalars with length

n−1

, which satisﬁes:

∀i, j ∈[1, n −1],

sign(di−dj) = sign(hi

i+1 −hj

j+1)(1)

Intuitively, syntactic distance

keeps the same

ranking order as the sequence of

(h1

2, h2

3, ..., hn−1

in which

i+1

is the height of the lowest common

ancestors between pairs of consecutive words in

the sentence (See Fig. 1(b)).

2.2 Generation of Syntactic Distance

The syntactic distance could be generated by re-

cursively spliting the constituency tree in a top-

down manner. According to the merging order of

constituency syntactic tree, for any subtree T, the

subtrees rooted by every child node of T must be

constructed at ﬁrst. Therefore, the merging of all

of T’s child nodes must take place afterwards. The

syntax distance in all the subtrees of T can be cal-

culated ﬁrst, and then the maximum distance value

plus 1 is the current merging distance order.

During preprocessing, the syntactic distance is

calculated on different datasets respectively. For

each sentence, we ﬁrst concatenate all BPE word

segmentations, then analyze the syntax tree struc-

ture using the Stanford corenlp toolkit (Manning

et al.,2014), and calculate the syntactic distance

according to the algorithm in Algorithm 1. When

ﬁlling in the syntactic distance between words in

BPE word segmentation, the lowest value is 0, and

ﬁnally all syntactic distances are added by 1 to be-

come a syntactic distance vector with a minimum

value of 1. In the case of multiple sentences, we

generate the syntactic distance of each sentence

respectively, and then ﬁll in the maximum 999 be-

tween different sentences, indicating that all sen-

tences are merged at last.

3 Method

We are going to present a form of local self-

attention that dynamically controls each word’s

attention range according to its syntactic role in

the sentence (See Fig. 1(c)). Attention heads that

incorporate this localized self-attention could sig-

niﬁcantly outweight the grammatically close tokens

over distant ones, thus incorporate the syntax infor-

mation as prior knowledge.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Syntax-guidedLocalizedSelf-attentionbyConstituencySyntacticDistanceShengyuanHou1JushiKai1HaotianXue1BingyuZhu2BoYuan2LongtaoHuang2XinbingWang1ZhouhanLin1y1ShanghaiJiaoTongUniversity2AlibabaGroup{hsyhwjsr,json.kai,xavihart}@sjtu.edu.cn{zhubingyu.zby,qiufu.yb,kaiyang.hlt}@alibaba-inc.comlin.zhouhan...

展开>> 收起<<

Syntax-guided Localized Self-attention by Constituency Syntactic Distance Shengyuan Hou1Jushi Kai1Haotian Xue1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Syntax-guided Localized Self-attention by Constituency Syntactic Distance Shengyuan Hou1Jushi Kai1Haotian Xue1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: