Composition Attention or Both Ryo Yoshida andYohei Oseki The University of Tokyo

2025-04-29 0 0 657.3KB 13 页 10玖币
侵权投诉
Composition, Attention, or Both?
Ryo Yoshida and Yohei Oseki
The University of Tokyo
{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp
Abstract
In this paper, we propose a novel architec-
ture called Composition Attention Gram-
mars (CAGs) that recursively compose sub-
trees into a single vector representation with
a composition function, and selectively at-
tend to previous structural information with
a self-attention mechanism. We investigate
whether these components—the composition
function and the self-attention mechanism—
can both induce human-like syntactic general-
ization. Specifically, we train language mod-
els (LMs) with and without these two com-
ponents with the model sizes carefully con-
trolled, and evaluate their syntactic generaliza-
tion performance against six test circuits on the
SyntaxGym benchmark. The results demon-
strated that the composition function and the
self-attention mechanism both play an impor-
tant role to make LMs more human-like, and
closer inspection of grammatical phenomena
implied that the composition function allowed
syntactic features, but not semantic features, to
percolate into subtree representations.
1 Introduction
Recently, language models (LMs) trained on large
datasets have achieved remarkable success in vari-
ous Natural Language Processing (NLP) tasks (cf.
Wang et al.,2019a,b). The literature of targeted
syntactic evaluations has shown that these mod-
els implicitly learn syntactic structures of natural
language, even though they do not receive explicit
syntactic supervision (Warstadt et al.,2020;Hu
et al.,2020).
However, previous work has also shown that
there is still a benefit for LMs to receive explicit
While writing this paper, we noticed that Sartran et al.
(2022) was submitted to the arXiv, proposing Transformer
Grammars (TGs) that incorporate recursive syntactic composi-
tion via an attention mask. Their work and ours are similar in
spirit, but different in how to obtain a vector representation of
subtrees, making them complementary. We discuss the details
in Section 5.
syntactic supervision. Recurrent Neural Network
Grammars (RNNGs; Dyer et al.,2016), the integra-
tion of Recurrent Neural Networks (RNNs; Elman,
1990) with an explicit syntactic bias, have achieved
better syntactic generalization performance than
vanilla RNNs (Kuncoro et al.,2018;Wilcox et al.,
2019;Hu et al.,2020). In addition, previous work
has recommended RNNGs as a cognitively plau-
sible architecture, showing that RNNGs can suc-
cessfully predict human reading times (Yoshida
et al.,2021) or brain activities (Hale et al.,2018).
The key difference between RNNGs and RNNs is
a
composition function
, which recursively com-
poses subtrees into a single vector representation.
On the other hand, Transformer architectures
(Vaswani et al.,2017) have been shown to out-
perform RNN architectures in various NLP tasks
(Devlin et al.,2019). The key difference between
Transformers and RNNs here is a
self-attention
mechanism
, which selectively attends to previous
vectors to obtain sentence representations. Re-
cently, an attempt was made to investigate whether
Transformer architectures with the self-attention
mechanism also benefit from explicit syntactic su-
pervision (Qian et al.,2021), but their “Parsing as
Language Modeling (PLM)” approach (Choe and
Charniak,2016) does not employ the composition
function, which is essential for RNNGs. Therefore,
it is reasonable to hypothesize that their approach
may not achieve the full benefit of explicit syntactic
supervision.
In this paper, we propose a novel architec-
ture called
Composition Attention Grammars
(CAGs) that recursively compose subtrees into a
single vector representation with the composition
function, and selectively attend to previous struc-
tural information with the self-attention mecha-
nism. We investigate whether these components—
the composition function and the self-attention
mechanism—can both induce human-like syntac-
tic generalization. Specifically, we train LMs with
arXiv:2210.12958v3 [cs.CL] 11 May 2023
S
NP
The bird
VP
sings
t1 2 3 4 5 6 7 8 9
atNT(S) NT(NP) GEN(The) GEN(bird) REDUCE NT(VP) GEN(sings) REDUCE REDUCE
Symbol (S (NP The bird ) (VP sings ) )
Figure 1: An example of actions to jointly generate the sentence and its syntactic structure in a top-down, left-to-
right fashion.
and without these two components, with the model
sizes carefully controlled, and evaluate their syn-
tactic generalization performance against six test
circuits (Hu et al.,2020) on the SyntaxGym bench-
mark (Gauthier et al.,2020). The results demon-
strated that the composition function and the self-
attention mechanism both play an important role
to make LMs more human-like, and closer inspec-
tion of grammatical phenomena implied that the
composition function allowed syntactic features,
but not semantic features, to percolate into subtree
representations.
In addition, the methodological innovation of
this paper is a strictly controlled experimental de-
sign, as practiced in cognitive sciences. In NLP
research, evaluations are often conducted on mod-
els with different model sizes, leading to uncer-
tainty regarding which component of these models
affects the results. This paper conducts strictly con-
trolled experiments in order to isolate the effects
of individual components such as the composition
function and the self-attention mechanism.
2 Composition Attention Grammar
In this section, we introduce a novel architecture
called Composition Attention Grammars (CAGs).
2.1 Syntactic language model
CAGs are a type of syntactic LM (Choe and Char-
niak,2016;Dyer et al.,2016;Qian et al.,2021),
which estimates the following joint distribution of
a sentence Xand its syntactic structure Y:
p(X, Y ) = p(a1,· · · , an) =
n
Y
t=1
p(at|a<t)(1)
where
at
is an action by which CAGs jointly gen-
erate the sentence and its syntactic structure in a
top-down, left-to-right fashion. Each
at
can be one
of the three actions below:
GEN(x): Generate a terminal symbol “x”.
NT(X): Open a nonterminal symbol “X”.
REDUCE
: Close a nonterminal symbol that was
opened by NT(X).
See Figure 1for an example of actions to jointly
generate the sentence and its syntactic structure in
a top-down, left-to-right fashion.
2.2 Architecture
To estimate the joint distribution in Equation 1,
CAGs utilize (i) the composition function to recur-
sively compose subtrees into a single vector rep-
resentation, and (ii) the self-attention mechanism
to selectively attend to previous structural infor-
mation. The architecture of CAGs is summarized
in Figure 2. Following previous work (Kuncoro
et al.,2017;Noji and Oseki,2021), CAGs rely on a
stack data structure, and each action in Section 2.1
changes the stack state as follows:
GEN(x)
: Push a terminal embedding
ex
onto
the stack.
NT(X)
: Push a nonterminal embedding
eX
onto the stack.
REDUCE
: First, repeatedly pop vectors from
the stack until a nonterminal embedding is
popped. Then, apply the composition func-
tion based on bidirectional LSTMs (Schuster
and Paliwal,1997) to these popped vectors
el,...,em
, to compose subtrees into a single
vector representation es:
es= Composition([el,...,em]).(2)
esis then pushed onto the stack.
After each action, CAGs employ the self-
attention mechanism, which selectively attends to
previous vectors in the stack
e1,...,ek
by calcu-
lating the weight of attention to each vector with
the query, key, and value vectors generated from
e1,...,ek
, in order to represent a partial parse at
each time step t:
ht= SelfAttn([e1,...,ek]).(3)
Modeling Human Sentence Processing
with Left-Corner Recurrent Neural Network Grammars
Ryo Yoshida1and Hiroshi Noji2and Yohei Oseki1
1The Univertisy of Tokyo
2Artificial Intelligence Research Center, AIST
{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp
hiroshi.noji@aist.go.jp
S
NP
N
you
VP
V
say
NP
N
goodbye
VP
PP
NP
N
sayonara
(goodbye)
P
wo
(-Acc)
V
iu
(say)
you say goodbye
S
NP
The bird
(S (NP The bird
GEN(x)
(S (NP The bird x
Modeling Human Sentence Processing
with Left-Corner Recurrent Neural Network Grammars
Ryo Yoshida1and Hiroshi Noji2and Yohei Oseki1
1The Univertisy of Tokyo
2Artificial Intelligence Research Center, AIST
{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp
hiroshi.noji@aist.go.jp
S
NP
N
you
VP
V
say
NP
N
goodbye
VP
PP
NP
N
sayonara
(goodbye)
P
wo
(-Acc)
V
iu
(say)
you say goodbye
S
NP
The bird x
NT(X)
(S (NP The bird (X
Modeling Human Sentence Processing
with Left-Corner Recurrent Neural Network Grammars
Ryo Yoshida1and Hiroshi Noji2and Yohei Oseki1
1The Univertisy of Tokyo
2Artificial Intelligence Research Center, AIST
{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp
hiroshi.noji@aist.go.jp
S
NP
N
you
VP
V
say
NP
N
goodbye
VP
PP
NP
N
sayonara
(goodbye)
P
wo
(-Acc)
V
iu
(say)
you say goodbye
S
NP
The bird X
REDUCE
Modeling Human Sentence Processing
with Left-Corner Recurrent Neural Network Grammars
Ryo Yoshida1and Hiroshi Noji2and Yohei Oseki1
1The Univertisy of Tokyo
2Artificial Intelligence Research Center, AIST
{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp
hiroshi.noji@aist.go.jp
S
NP
N
you
VP
V
say
NP
N
goodbye
VP
PP
NP
N
sayonara
(goodbye)
P
wo
(-Acc)
V
iu
(say)
you say goodbye
S
NP
The bird
(S (NP)
SelfAttn
SelfAttn
SelfAttn
SelfAttn
Composition
(S (NP The bird
(S (NP The bird x
(S (NP)
(S (NP The bird (X
Figure 2: The architecture of Composition Attention Grammars (CAGs). CAGs utilize (i) the composition func-
tion to recursively compose subtrees into a single vector representation, and (ii) the self-attention mechanism to
selectively attend to previous structural information.
Then, htdefines the next action distribution:
at+1 softmax(Waht+ba)(4)
where
Wa
and
ba
are the weights and biases of a
fully connected layer that projects
ht
to logits for
each action
a
, and
softmax
is a softmax function
that projects the logits to the next action distribu-
tion.
2.3 Differences from other syntactic LMs
In this subsection, we focus on the differences be-
tween CAGs and other syntactic LMs.
Difference from RNNGs
CAGs and RNNGs
both utilize the composition function to recursively
compose subtrees into a single vector representa-
tion. CAGs differ from RNNGs in that, in order to
represent the partial parse at each time step, CAGs
utilize the self-attention mechanism which selec-
tively attends to previous structural information,
whereas RNNGs utilize stack-LSTMs (Dyer et al.,
2015). We hypothesize that CAGs have the ad-
vantage of selective attention to previous structural
information over RNNGs.
Difference from PLMs
CAGs and PLMs both
utilize the self-attention mechanism which selec-
tively attends to previous structural information.
CAGs differ from PLMs in that CAGs utilize the
composition function to recursively compose sub-
trees into a single vector representation, whereas
PLMs treat actions
a1, . . . , an
flatly as vanilla
Transformers treat words
w1, ..., wn
. We hypoth-
esize that CAGs have the advantage of recursive
composition of subtrees over PLMs.
In order to incorporate composition-like charac-
teristics, Qian et al. (2021) proposed PLM-masks,
namely, PLMs with a dynamic masking mecha-
nism, which specializes two attention heads: one to
attend to the inside of the most recently opened non-
terminal symbol, and another to attend to the out-
side. We will perform a comparison between CAGs
and PLM-masks in order to investigate whether
recursive composition of subtrees has additional
advantages over the dynamic masking mechanism
in inducing human-like syntactic generalization.
3 Experiment
We designed a strictly controlled experiment for
testing whether the two components—the compo-
sition function and the self-attention mechanism—
can both induce human-like syntactic generaliza-
tion. Specifically, we train LMs with and with-
out these two components with the model sizes
carefully controlled, and evaluate their syntactic
generalization performance against six test circuits
on the SyntaxGym benchmark. We also train and
evaluate two vanilla LMs with and without the self-
摘要:

Composition,Attention,orBoth?RyoYoshidaandYoheiOsekiTheUniversityofTokyo{yoshiryo0617,oseki}@g.ecc.u-tokyo.ac.jpAbstractInthispaper,weproposeanovelarchitec-turecalledCompositionAttentionGram-mars(CAGs)thatrecursivelycomposesub-treesintoasinglevectorrepresentationwithacompositionfunction,andselectiv...

展开>> 收起<<
Composition Attention or Both Ryo Yoshida andYohei Oseki The University of Tokyo.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:657.3KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注