Structural generalization is hard for sequence-to-sequence models Yuekun Yao and Alexander Koller Department of Language Science and Technology

2025-05-02 0 0 461.06KB 15 页 10玖币

侵权投诉

Structural generalization is hard for sequence-to-sequence models

Yuekun Yao and Alexander Koller

Department of Language Science and Technology

Saarland Informatics Campus

Saarland University, Saarbrücken, Germany

{ykyao, koller}@coli.uni-saarland.de

Abstract

Sequence-to-sequence (seq2seq) models have

been successful across many NLP tasks, in-

cluding ones that require predicting linguistic

structure. However, recent work on composi-

tional generalization has shown that seq2seq

models achieve very low accuracy in general-

izing to linguistic structures that were not seen

in training. We present new evidence that this

is a general limitation of seq2seq models that

is present not just in semantic parsing, but also

in syntactic parsing and in text-to-text tasks,

and that this limitation can often be overcome

by neurosymbolic models that have linguistic

knowledge built in. We further report on some

experiments that give initial answers on the

reasons for these limitations.

1 Introduction

Humans are able to understand and produce lin-

guistic structures they have never observed before

(Chomsky,1957;Fodor and Pylyshyn,1988;Fodor

and Lepore,2002). From limited, ﬁnite observa-

tions, they generalize at an early age to an inﬁnite

variety of novel structures using recursion. They

can also assign meaning to these, using the Princi-

ple of Compositionality. This ability to generalize

to unseen structures is important for NLP systems

in low-resource settings, such as underresourced

languages or projects with a limited annotation bud-

get, where a user can easily use structures that had

no annotations in training.

Over the past few years, large pretrained

sequence-to-sequence (seq2seq) models, such as

BART (Lewis et al.,2020) and T5 (Raffel et al.,

2020), have brought tremendous progress to many

NLP tasks. This includes linguistically complex

tasks such as broad-coverage semantic parsing,

where e.g. a lightly modiﬁed BART set a new

state of the art on AMR parsing (Bevilacqua et al.,

2021). However, there have been some concerns

that seq2seq models may have difﬁculties with com-

positional generalization, a class of tasks in seman-

tic parsing where the training data is structurally

impoverished in comparison to the test data (Lake

and Baroni,2018;Keysers et al.,2020). We focus

on the COGS dataset of Kim and Linzen (2020) be-

cause some of its generalization types speciﬁcally

target structural generalization, i.e. the ability to

generalize to unseen structures.

In this paper, we make two contributions. First,

we offer evidence that structural generalization is

systematically hard for seq2seq models. On the

semantic parsing task of COGS, seq2seq mod-

els don’t fail on compositional generalization as

a whole, but speciﬁcally on the three COGS gener-

alization types that require generalizing to unseen

linguistic structures, achieving accuracies below

10%. This is true both for BART and T5 and for

seq2seq models that were speciﬁcally developed

for COGS. What’s more, BART and T5 fail simi-

larly on syntax and even POS tagging variants of

COGS (introduced in this paper), indicating that

they do not only struggle with compositional gen-

eralization in semantics, but with structural gener-

alization more generally. Structure-aware models,

such as the compositional semantic parsers of Liu

et al. (2021) and Weißenhorn et al. (2022) and the

Neural Berkeley Parser (Kitaev and Klein,2018),

achieve perfect accuracy on these tasks.

Second, we conduct a series of experiments to

investigate what makes structural generalization so

hard for seq2seq models. It is not because the en-

coder loses structurally relevant information: One

can train a probe to predict COGS syntax from

BART encodings, in line with earlier work (Hewitt

and Manning,2019;Tenney et al.,2019a); but the

decoder does not learn to use it for structural gen-

eralization. We ﬁnd further that the decoder does

not even learn to generalize semantically when the

input is enriched with syntactic structure. Finally,

it is not merely because the COGS tasks require

the mapping of language into symbolic represen-

arXiv:2210.13050v1 [cs.CL] 24 Oct 2022

Training Generalization

(a) LEX

subj_to_obj

(common noun)

A hedgehog ate the cake.

*cake(x4); hedgehog(x1)∧

eat.agent(x2, x1)∧eat.theme(x2, x4)

The baby liked the hedgehog.

*baby(x1); *hedgehog(x4);

like.agent(x2, x1)∧like.theme(x2, x4)

(b) STRUCT

PP recursion

Ava saw a ball in a bowl on the table.

*table(x9); see.agent(x1,Ava)

∧see.theme(x1, x3)∧ball(x3)∧

ball.nmod.in(x3, x6)∧bowl(x6)∧

bowl.nmod.on(x6, x9)

Ava saw a ball in a bowl on the table on the ﬂoor.

*table(x9); *floor(x12); see.agent(x1,

Ava) ∧see.theme(x1, x3)∧

ball(x3)∧ball.nmod.in(x3, x6)∧

bowl(x6)∧bowl.nmod.on(x6, x9)

∧table.nmod.on(x9, x12)

obj_to_subj PP

Noah ate the cake on the plate.

*cake(x3); *plate(x6);

eat.agent(x1,Noah) ∧eat.theme(x1, x3)

∧cake.nmod.on(x3, x6)

The cake on the table burned.

*cake(x1); *table(x4);

cake.nmod.on(x1, x4)∧

burn.theme(x3, x1)

Figure 1: Some examples from the COGS dataset. LEX represents lexical generalization and STRUCT denotes

structural generalization.

tations. We introduce a new text-to-text variant of

COGS called QA-COGS, where questions about

COGS sentences must be answered in English. We

ﬁnd that T5 performs well on structural general-

ization with the original COGS sentences, but all

models still struggle with a harder text-to-text task

involving structural disambiguation.

The code1and datasets2are available online.

2 Related work

The recent interest in compositional generalization

has raised concerns about limitations of seq2seq

models. For instance, the SCAN dataset (Lake

and Baroni,2018) requires a model to translate

natural-language instructions into symbolic action

sequences; it has multiple splits in which the test

data contains new combinations of commands or

instructions that are systematically longer than in

training. The PCFG dataset (Hupkes et al.,2020)

builds upon SCAN and adds instructions with re-

cursive structure. The CFQ dataset (Keysers et al.,

2020) maps questions to SPARQL queries, and

splits the data according to a measure of compo-

sitional complexity (MCD). In all of these papers,

simple seq2seq models based on LSTMs and trans-

formers were shown to perform poorly when the

test data was more complex than the training data.

Since then, followup research has shown that

both generic transformer-based models (Ontanon

et al.,2022;Csordás et al.,2021), general-purpose

pretrained models (Furrer et al.,2020), and seq2seq

models that are specialized for the task can achieve

1https://github.com/coli-saar/

Seq2seq-on-COGS

2https://github.com/coli-saar/

Syntax-COGS

higher accuracies than the ones reported in the pa-

pers introducing the datasets. Nonetheless, there is

a sense that despite the best efforts of the commu-

nity, pure seq2seq models are hitting a ceiling on

compositional generalization tasks.

In this paper, we shed some light on the issue

by (a) clarifying that seq2seq models do not strug-

gle with compositional generalization per se, but

with structural generalization, and (b) demonstrat-

ing that this type of generalization remains hard

for seq2seq models even after heavy pretraining.

This is in contrast to most previous research, which

has avoided pretraining and focused on length or

MCD as the primary source of difﬁculty. Our data

includes instances where the structure, but not the

length differs between training and testing, and

therefore allows us to differentiate between the two.

The importance of structure to compositional gener-

alization is also recognized by Bogin et al. (2022).

The difﬁculty of structural generalization for

neural models has also been studied in more tar-

geted ways. For instance, Yu et al. (2019) show

empirically that LSTM-based seq2seq models can-

not learn to close the brackets of Dyck languages,

and Hahn (2020) proves that transformers cannot

learn to distinguish well-bracketed Dyck expres-

sions. McCoy et al. (2020) ﬁnd empirically that

seq2seq models struggle to learn the structural op-

erations necessary to rewrite declarative English

sentences into questions, whereas tree-based mod-

els work better.

3 Structural generalization in COGS

COGS (Kim and Linzen,2020) is a synthetic

semantic parsing dataset in which English sen-

Noah

ate

the cake

on the plate

=⇒S

the cake

on the table

burned

(a) object PP to subject PP

Ava

saw

a ball

in NP

a bowl

on NP

the table

=⇒S

Ava

saw

a ball

in NP

a bowl

on NP

the table

on the ﬂoor

(b) PP recursion

Figure 2: Structural generalization in COGS.

tences must be mapped to logic-based meaning

representations (see Fig. 1for some examples).

It distinguishes 21 generalization types, each of

which requires generalizing from training instances

to test instances in a particular systematic and

linguistically-informed way. COGS was designed

to measure compositional generalization, the abil-

ity of a semantic parser to assign correct mean-

ing representations to out-of-distribution sentences.

Unlike SCAN and CFQ, it includes generalization

types with unbounded recursion and separates them

cleanly from other generalization types, both of

which are crucial for the experiments reported here.

Most generalization types in COGS are lexical:

they recombine known grammatical structures with

words that were not observed in these particular

structures in training. An example is the general-

ization type “subject to object” (Fig. 1a), in which

a noun (“hedgehog”) is only seen as a subject in

training, whereas it is only used as on object at test

time. The syntactic structure at test time was al-

ready observed in training; only the words change.

By contrast, structural generalization involves

generalizing to linguistic structures that were not

seen in training (cf. Fig. 1b,c). Examples are the

generalization types “PP recursion”, where training

instances contain prepositional phrases of depth up

to two and generalization instances have PPs of

depth 3–12; and “object PP to subject PP”, where

PPs modify only objects in training and only sub-

jects at test time. These structural changes are

illustrated in Fig. 2.

Structural generalization requires learning about

recursion and compositionality, and is thus a more

thorough test of human-like language use, whereas

lexical generalization amounts to smart template

ﬁlling. In this paper, we investigate how well

structural generalization can be solved by differ-

ent classes of model architectures: seq2seq models

and structure-aware models. We deﬁne a model

as “structure-aware” if it is explicitly designed to

encode linguistic knowledge beyond the fact that

sentences are sequences of tokens. This captures

a large class of models that can be as “deep” as a

compositional semantic parser or as “shallow” as a

POS tagger that requires that each input token gets

exactly one POS tag.

4 Structural generalization is hard for

seq2seq

We begin with some evidence that structural gen-

eralization in COGS is hard for seq2seq models,

while structure-aware models learn it quite easily.

We ﬁrst collect some results on the original se-

mantic parsing task of COGS, extending it with

numbers for BART and T5. We then transform

COGS into a corpus for syntactic parsing and POS

tagging and investigate the ability of BART and T5

to generalize structurally on these tasks.

4.1 Experimental setup: COGS

We follow standard COGS practice and evaluate all

models on the generalization set. We report exact

match accuracies, averaged across 5 training runs.

Seq2seq models.

We train BART (Lewis et al.,

2020) and T5 (Raffel et al.,2020) as semantic

parsers on COGS. Both models are strong represen-

tatives of seq2seq models and perform well across

many NLP tasks. To apply these models on COGS,

we directly ﬁne-tune the pretrained bart-base and

t5-base model on it with the corresponding tok-

enizer; see Appendix Afor details. We also report

results for a wide range of published seq2seq mod-

els for COGS (Kim and Linzen,2020;Conklin

et al.,2021;Csordás et al.,2021;Akyürek and An-

dreas,2021;Zheng and Lapata,2022;Qiu et al.,

2021).

Structure-aware models.

We report evaluation

results for LeAR (Liu et al.,2021) and the AM

parser (Weißenhorn et al.,2022). Both models

learn to predict a tree structure which is decoded

into COGS meaning representations using the Prin-

ciple of Compositionality. Thus both models are

structure-aware.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Structuralgeneralizationishardforsequence-to-sequencemodelsYuekunYaoandAlexanderKollerDepartmentofLanguageScienceandTechnologySaarlandInformaticsCampusSaarlandUniversity,Saarbrücken,Germany{ykyao,koller}@coli.uni-saarland.deAbstractSequence-to-sequence(seq2seq)modelshavebeensuccessfulacrossmanyNLPta...

展开>> 收起<<

Structural generalization is hard for sequence-to-sequence models Yuekun Yao and Alexander Koller Department of Language Science and Technology.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Structural generalization is hard for sequence-to-sequence models Yuekun Yao and Alexander Koller Department of Language Science and Technology

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: