Composition Attention or Both Ryo Yoshida andYohei Oseki The University of Tokyo

2025-04-29 0 0 657.3KB 13 页 10玖币

侵权投诉

Composition, Attention, or Both?∗

Ryo Yoshida and Yohei Oseki

The University of Tokyo

{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp

Abstract

In this paper, we propose a novel architec-

ture called Composition Attention Gram-

mars (CAGs) that recursively compose sub-

trees into a single vector representation with

a composition function, and selectively at-

tend to previous structural information with

a self-attention mechanism. We investigate

whether these components—the composition

function and the self-attention mechanism—

can both induce human-like syntactic general-

ization. Speciﬁcally, we train language mod-

els (LMs) with and without these two com-

ponents with the model sizes carefully con-

trolled, and evaluate their syntactic generaliza-

tion performance against six test circuits on the

SyntaxGym benchmark. The results demon-

strated that the composition function and the

self-attention mechanism both play an impor-

tant role to make LMs more human-like, and

closer inspection of grammatical phenomena

implied that the composition function allowed

syntactic features, but not semantic features, to

percolate into subtree representations.

1 Introduction

Recently, language models (LMs) trained on large

datasets have achieved remarkable success in vari-

ous Natural Language Processing (NLP) tasks (cf.

Wang et al.,2019a,b). The literature of targeted

syntactic evaluations has shown that these mod-

els implicitly learn syntactic structures of natural

language, even though they do not receive explicit

syntactic supervision (Warstadt et al.,2020;Hu

et al.,2020).

However, previous work has also shown that

there is still a beneﬁt for LMs to receive explicit

∗

While writing this paper, we noticed that Sartran et al.

(2022) was submitted to the arXiv, proposing Transformer

Grammars (TGs) that incorporate recursive syntactic composi-

tion via an attention mask. Their work and ours are similar in

spirit, but different in how to obtain a vector representation of

subtrees, making them complementary. We discuss the details

in Section 5.

syntactic supervision. Recurrent Neural Network

Grammars (RNNGs; Dyer et al.,2016), the integra-

tion of Recurrent Neural Networks (RNNs; Elman,

1990) with an explicit syntactic bias, have achieved

better syntactic generalization performance than

vanilla RNNs (Kuncoro et al.,2018;Wilcox et al.,

2019;Hu et al.,2020). In addition, previous work

has recommended RNNGs as a cognitively plau-

sible architecture, showing that RNNGs can suc-

cessfully predict human reading times (Yoshida

et al.,2021) or brain activities (Hale et al.,2018).

The key difference between RNNGs and RNNs is

composition function

, which recursively com-

poses subtrees into a single vector representation.

On the other hand, Transformer architectures

(Vaswani et al.,2017) have been shown to out-

perform RNN architectures in various NLP tasks

(Devlin et al.,2019). The key difference between

Transformers and RNNs here is a

self-attention

mechanism

, which selectively attends to previous

vectors to obtain sentence representations. Re-

cently, an attempt was made to investigate whether

Transformer architectures with the self-attention

mechanism also beneﬁt from explicit syntactic su-

pervision (Qian et al.,2021), but their “Parsing as

Language Modeling (PLM)” approach (Choe and

Charniak,2016) does not employ the composition

function, which is essential for RNNGs. Therefore,

it is reasonable to hypothesize that their approach

may not achieve the full beneﬁt of explicit syntactic

supervision.

In this paper, we propose a novel architec-

ture called

Composition Attention Grammars

(CAGs) that recursively compose subtrees into a

single vector representation with the composition

function, and selectively attend to previous struc-

tural information with the self-attention mecha-

nism. We investigate whether these components—

the composition function and the self-attention

mechanism—can both induce human-like syntac-

tic generalization. Speciﬁcally, we train LMs with

arXiv:2210.12958v3 [cs.CL] 11 May 2023

The bird

sings

t1 2 3 4 5 6 7 8 9

atNT(S) NT(NP) GEN(The) GEN(bird) REDUCE NT(VP) GEN(sings) REDUCE REDUCE

Symbol (S (NP The bird ) (VP sings ) )

Figure 1: An example of actions to jointly generate the sentence and its syntactic structure in a top-down, left-to-

right fashion.

and without these two components, with the model

sizes carefully controlled, and evaluate their syn-

tactic generalization performance against six test

circuits (Hu et al.,2020) on the SyntaxGym bench-

mark (Gauthier et al.,2020). The results demon-

strated that the composition function and the self-

attention mechanism both play an important role

to make LMs more human-like, and closer inspec-

tion of grammatical phenomena implied that the

composition function allowed syntactic features,

but not semantic features, to percolate into subtree

representations.

In addition, the methodological innovation of

this paper is a strictly controlled experimental de-

sign, as practiced in cognitive sciences. In NLP

research, evaluations are often conducted on mod-

els with different model sizes, leading to uncer-

tainty regarding which component of these models

affects the results. This paper conducts strictly con-

trolled experiments in order to isolate the effects

of individual components such as the composition

function and the self-attention mechanism.

2 Composition Attention Grammar

In this section, we introduce a novel architecture

called Composition Attention Grammars (CAGs).

2.1 Syntactic language model

CAGs are a type of syntactic LM (Choe and Char-

niak,2016;Dyer et al.,2016;Qian et al.,2021),

which estimates the following joint distribution of

a sentence Xand its syntactic structure Y:

p(X, Y ) = p(a1,· · · , an) =

t=1

p(at|a<t)(1)

where

is an action by which CAGs jointly gen-

erate the sentence and its syntactic structure in a

top-down, left-to-right fashion. Each

can be one

of the three actions below:

•GEN(x): Generate a terminal symbol “x”.

•NT(X): Open a nonterminal symbol “X”.

•REDUCE

: Close a nonterminal symbol that was

opened by NT(X).

See Figure 1for an example of actions to jointly

generate the sentence and its syntactic structure in

a top-down, left-to-right fashion.

2.2 Architecture

To estimate the joint distribution in Equation 1,

CAGs utilize (i) the composition function to recur-

sively compose subtrees into a single vector rep-

resentation, and (ii) the self-attention mechanism

to selectively attend to previous structural infor-

mation. The architecture of CAGs is summarized

in Figure 2. Following previous work (Kuncoro

et al.,2017;Noji and Oseki,2021), CAGs rely on a

stack data structure, and each action in Section 2.1

changes the stack state as follows:

•GEN(x)

: Push a terminal embedding

onto

the stack.

•NT(X)

: Push a nonterminal embedding

onto the stack.

•REDUCE

: First, repeatedly pop vectors from

the stack until a nonterminal embedding is

popped. Then, apply the composition func-

tion based on bidirectional LSTMs (Schuster

and Paliwal,1997) to these popped vectors

el,...,em

, to compose subtrees into a single

vector representation es:

es= Composition([el,...,em]).(2)

esis then pushed onto the stack.

After each action, CAGs employ the self-

attention mechanism, which selectively attends to

previous vectors in the stack

e1,...,ek

by calcu-

lating the weight of attention to each vector with

the query, key, and value vectors generated from

e1,...,ek

, in order to represent a partial parse at

each time step t:

ht= SelfAttn([e1,...,ek]).(3)

Modeling Human Sentence Processing

with Left-Corner Recurrent Neural Network Grammars

Ryo Yoshida1and Hiroshi Noji2and Yohei Oseki1

1The Univertisy of Tokyo

2Artiﬁcial Intelligence Research Center, AIST

{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp

hiroshi.noji@aist.go.jp

you

say

goodbye

sayonara

(goodbye)

(-Acc)

(say)

you say goodbye

The bird

(S (NP The bird

GEN(x)

(S (NP The bird x

Modeling Human Sentence Processing

with Left-Corner Recurrent Neural Network Grammars

Ryo Yoshida1and Hiroshi Noji2and Yohei Oseki1

1The Univertisy of Tokyo

2Artiﬁcial Intelligence Research Center, AIST

{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp

hiroshi.noji@aist.go.jp

you

say

goodbye

sayonara

(goodbye)

(-Acc)

(say)

you say goodbye

The bird x

NT(X)

(S (NP The bird (X

Modeling Human Sentence Processing

with Left-Corner Recurrent Neural Network Grammars

Ryo Yoshida1and Hiroshi Noji2and Yohei Oseki1

1The Univertisy of Tokyo

2Artiﬁcial Intelligence Research Center, AIST

{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp

hiroshi.noji@aist.go.jp

you

say

goodbye

sayonara

(goodbye)

(-Acc)

(say)

you say goodbye

The bird X

REDUCE

Modeling Human Sentence Processing

with Left-Corner Recurrent Neural Network Grammars

Ryo Yoshida1and Hiroshi Noji2and Yohei Oseki1

1The Univertisy of Tokyo

2Artiﬁcial Intelligence Research Center, AIST

{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp

hiroshi.noji@aist.go.jp

you

say

goodbye

sayonara

(goodbye)

(-Acc)

(say)

you say goodbye

The bird

(S (NP)

SelfAttn

Composition

(S (NP The bird

(S (NP The bird x

(S (NP)

(S (NP The bird (X

Figure 2: The architecture of Composition Attention Grammars (CAGs). CAGs utilize (i) the composition func-

tion to recursively compose subtrees into a single vector representation, and (ii) the self-attention mechanism to

selectively attend to previous structural information.

Then, htdeﬁnes the next action distribution:

at+1 ∼softmax(Waht+ba)(4)

where

and

are the weights and biases of a

fully connected layer that projects

to logits for

each action

, and

softmax

is a softmax function

that projects the logits to the next action distribu-

tion.

2.3 Differences from other syntactic LMs

In this subsection, we focus on the differences be-

tween CAGs and other syntactic LMs.

Difference from RNNGs

CAGs and RNNGs

both utilize the composition function to recursively

compose subtrees into a single vector representa-

tion. CAGs differ from RNNGs in that, in order to

represent the partial parse at each time step, CAGs

utilize the self-attention mechanism which selec-

tively attends to previous structural information,

whereas RNNGs utilize stack-LSTMs (Dyer et al.,

2015). We hypothesize that CAGs have the ad-

vantage of selective attention to previous structural

information over RNNGs.

Difference from PLMs

CAGs and PLMs both

utilize the self-attention mechanism which selec-

tively attends to previous structural information.

CAGs differ from PLMs in that CAGs utilize the

composition function to recursively compose sub-

trees into a single vector representation, whereas

PLMs treat actions

a1, . . . , an

ﬂatly as vanilla

Transformers treat words

w1, ..., wn

. We hypoth-

esize that CAGs have the advantage of recursive

composition of subtrees over PLMs.

In order to incorporate composition-like charac-

teristics, Qian et al. (2021) proposed PLM-masks,

namely, PLMs with a dynamic masking mecha-

nism, which specializes two attention heads: one to

attend to the inside of the most recently opened non-

terminal symbol, and another to attend to the out-

side. We will perform a comparison between CAGs

and PLM-masks in order to investigate whether

recursive composition of subtrees has additional

advantages over the dynamic masking mechanism

in inducing human-like syntactic generalization.

3 Experiment

We designed a strictly controlled experiment for

testing whether the two components—the compo-

sition function and the self-attention mechanism—

can both induce human-like syntactic generaliza-

tion. Speciﬁcally, we train LMs with and with-

out these two components with the model sizes

carefully controlled, and evaluate their syntactic

generalization performance against six test circuits

on the SyntaxGym benchmark. We also train and

evaluate two vanilla LMs with and without the self-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Composition,Attention,orBoth?RyoYoshidaandYoheiOsekiTheUniversityofTokyo{yoshiryo0617,oseki}@g.ecc.u-tokyo.ac.jpAbstractInthispaper,weproposeanovelarchitec-turecalledCompositionAttentionGram-mars(CAGs)thatrecursivelycomposesub-treesintoasinglevectorrepresentationwithacompositionfunction,andselectiv...

展开>> 收起<<

Composition Attention or Both Ryo Yoshida andYohei Oseki The University of Tokyo.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Composition Attention or Both Ryo Yoshida andYohei Oseki The University of Tokyo

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: