Compositional Generalisation with Structured Reordering and Fertility Layers Matthias Lindemann1andAlexander Koller2andIvan Titov13

2025-04-27 0 0 666.15KB 15 页 10玖币

侵权投诉

Compositional Generalisation

with Structured Reordering and Fertility Layers

Matthias Lindemann1and Alexander Koller2and Ivan Titov1,3

1ILCC, University of Edinburgh, 2LST, Saarland University, 3ILLC, University of Amsterdam

m.m.lindemann@sms.ed.ac.uk,koller@coli.uni-saarland.de,ititov@inf.ed.ac.uk

Abstract

Seq2seq models have been shown to strug-

gle with compositional generalisation, i.e. gen-

eralising to new and potentially more com-

plex structures than seen during training. Tak-

ing inspiration from grammar-based models

that excel at compositional generalisation, we

present a ﬂexible end-to-end differentiable

neural model that composes two structural op-

erations: a fertility step, which we introduce

in this work, and a reordering step based on

previous work (Wang et al.,2021). To ensure

differentiability, we use the expected value

of each step, which we compute using dy-

namic programming. Our model outperforms

seq2seq models by a wide margin on chal-

lenging compositional splits of realistic seman-

tic parsing tasks that require generalisation to

longer examples. It also compares favourably

to other models targeting compositional gener-

alisation.1

1 Introduction

Many NLP tasks require translating an input object

such as a sentence into a structured output object

such as a semantic parse. Recently, these tasks

have been approached with seq2seq models with

great success. However, seq2seq models also have

been shown to struggle out-of-distribution on com-

positional generalisation (Lake and Baroni,2018;

Finegan-Dollak et al.,2018;Kim and Linzen,2020;

Hupkes et al.,2020), i.e. the model fails on exam-

ples that contain unseen compositions or deeper

recursion of phenomena that it handles correctly in

isolation.

Consider the example in Fig. 1. Arguably, any

model that produces the given semantic parse from

the input in a generalisable way has to capture the

correspondence between fragments of the input and

fragments of the output, at least implicitly. This

is challenging because of structural mismatches

1https://github.com/namednil/f-then-r

Figure 1: We model structural seq2seq tasks as the com-

position of differentiable fertility and phrase reordering

layers. The model is trained end-to-end without direct

supervision of the two structural layers.

between input and output. For example, the frag-

ment contributed by “ﬂights" is discontinuous and

intertwined with the rest of the semantic parse.

In contrast to seq2seq models, grammar-based

models such as synchronous context-free grammars

(SCFGs) (Lewis and Stearns,1968;Chiang,2007)

explicitly capture the compositional process behind

the data and therefore perform very well in compo-

sitional generalisation setups. They can also model

common structural mismatches like the one shown

in the example. However, grammar-based models

are rigid and brittle and thus do not scale well.

In this work, we take inspiration from grammar-

based models and present an end-to-end differen-

tiable neural model that is both ﬂexible and gen-

eralises well compositionally. Our model consists

of two structural layers: a phrase reordering layer

originally introduced by Wang et al. (2021) and a

fertility layer, new in this work, which creates zero

or more copies of the representation of any input

token. We show how to compose these layers to

achieve strong generalisation.

We use a simple decoder and essentially translate

each token after the fertility and reordering layers

arXiv:2210.03183v2 [cs.CL] 15 Feb 2023

independently into one output token. We found this

to lead to better generalisation than using an LSTM-

based decoder. The simple decoder also makes it

easy to integrate grammar constraints to ensure

well-formed outputs, e.g. in semantic parsing.

Most seq2seq models tend to predict the end of

the sequence too early on instances that are longer

than seen in training (Newman et al.,2020). Our

model overcomes this by explicitly predicting the

length of the output sequence as a sum of fertilities.

We ﬁrst demonstrate the expressive capacity and

inductive bias of our model on synthetic data. We

then evaluate on three English semantic parsing

datasets (Geoquery, ATIS, Okapi) and a style trans-

fer task (Lyu et al.,2021). We clearly outperform

seq2seq models both with and without pretrain-

ing in structural generalisation setups, particularly

when the model has to generalise to longer exam-

ples. Our model also compares favourably with

existing approaches that target structural generali-

sation, and we obtain state-of-the-art results on the

structural style transfer tasks.

To summarise, our main contributions are:

• an efﬁcient differentiable fertility layer;

•

a ﬂexible end-to-end differentiable model that

composes two structural operations (fertility

and reordering) and achieves strong perfor-

mance in structural generalisation tasks.

2 Related Work

Fertility

The concept of fertility was introduced

by Brown et al. (1990) for statistical machine trans-

lation to capture that a word in one language is

often consistently translated to a certain number of

words in another language. Tu et al. (2016) and

Malaviya et al. (2018) incorporate fertility into the

attention mechanism of seq2seq models. Cohn et al.

(2016); Gu et al. (2016) use heuristic supervision

for training their fertility models. In contrast to

prior work, we learn an explicit fertility component

jointly with the rest of our model.

Monotonic alignment

Related to fertility is the

concept of monotonic alignment, i.e. an alignment

that maps output positions to input positions

such that for any two output positions

i<j

a(i)≤a(j)

. Monotonic alignments are usually

modelled by an HMM-like model that places the

monotonicity constraint on the transition matrix

(Yu et al.,2016;Wu and Cotterell,2019), leading to

a runtime of

O(|x|2|y|)

with

being the input and

the output. Raffel et al. (2017) parameterise the

alignment using a series of Bernoulli trials and ob-

tain a training runtime of

O(|x||y|)

. Our approach

also has O(|x||y|)runtime.

Compositional generalisation

There is a grow-

ing body of work on improving the ability of neural

models to generalise compositionally in semantic

parsing. Good progress has been made in terms of

generalisation to new lexical items (Andreas,2020;

Akyürek and Andreas,2021;Conklin et al.,2021;

Csordás et al.,2021;Ontañón et al.,2022) but struc-

tural generalisation remains very challenging (Oren

et al.,2020;Bogin et al.,2022).

Herzig and Berant (2021) use a neural chart

parser and induce latent trees with an approach

similar to hard EM. Their model assumes that one

input token corresponds to a single leaf in the tree.

Zheng and Lapata (2022) re-encode the input and

partially generated output with a transformer for ev-

ery decoding step to reduce entanglement and show

considerable gains in structural generalisation.

There has also been work inspired by quasi-

synchronous grammars (QCFGs, Smith and Eisner

(2006)). Shaw et al. (2021) heuristically induce a

QCFG and create an ensemble of a QCFG-based

parser and a seq2seq model. Qiu et al. (2021) use

similar QCFGs for data augmentation for a seq2seq

model. Our approach does not require constructing

a grammar. Finally, Kim (2021) introduces neural

QCFGs which perform well on compositional gen-

eralisation tasks but are very compute-intensive.

Closest to our work is that of Wang et al. (2021)

who reorder phrases and use a monotonic atten-

tion mechanism on top. Our approach differs from

theirs in several important aspects: (i) we use fer-

tility instead of monotonic attention, which param-

eterises alignments differently; (ii) we apply the

fertility step ﬁrst and then reorder the phrases, so

our model can directly create alignments where out-

put tokens aligned to the same input token do not

have to be consecutive; (iii) we predict the length

as the sum of the fertilities and not with an end-

of-sequence token; (iv) they use an LSTM-based

decoder network whereas we found that a simpler

decoder can generalise better.

3 Background

3.1 Structured Attention

We often want to model the relationship between

input

of length

and output

of length

means of a latent variable

. In particular, we

assume that

Z∈ Z ⊆ Bn×l

is a boolean alignment

matrix. We want to predict

from a representation

produced by a function

g(Z, x)

. In this case, the

marginal distribution

P(y|x) = EP(Z|x)P(y|g(Z, x)) (1)

can be intractable to compute because

often has

exponential size in

. In some important cases

however we can formulate a similar but tractable

model by using structured attention (Kim et al.,

2017) which ‘pushes’ the expectation inside the

model:

P(y|x)≈P(y|g(˜

Z, x))

where

Z=EP(Z|x)Z

is now a ‘soft’ rather than a

boolean matrix. Note that

Zij =P(Zij = 1|x)

a marginal probability that often can be efﬁciently

computed with a dynamic programme if

P(Z|x)

factorisable. We will use such an approach for our

model.

3.2 Marginal Permutations

In this section, we brieﬂy review the method of

Wang et al. (2021) that we use in our model.

Wang et al. (2021) build on bracketing trans-

duction grammars (Wu,1997) and show how to

compute a distribution over separable permutations

(Bose et al.,1998). A permutation is separable, iff

it can be represented as a permutation tree. Some

permutations are not separable. The internal nodes

of a permutation tree are labelled as

∧

and

are interpreted as operations:

∧

concatenates the

values it receives from its left child with the value

from its right child, whereas

concatenates them

in reverse order. For example, the permutation tree

t= (4(∧a b) (4c d))

represents the permutation

abcd →dcab

. Let

Rt(i, j)=1

if the permutation

described by tree

maps position

to position

otherwise Rt(i, j) = 0.

Wang et al. (2021) show how to compute the ex-

pected permutation matrix

Ri,j ,EP(t|x)Rt(i, j)

in polynomial time with a CYK-style algorithm if

P(t|x)

factors according to the CYK chart. Cru-

cially, the expected permutation can also be inter-

preted as a distribution over alignments, where

Ri,j

is the marginal probability that position

aligns to

position

. We will use this alignment distribution

as a building block in our model.

4 Overview of the Approach

In this section, we give a general overview of our

approach and defer the details to Sections 5and 6.

Conceptually, we want to model the transduction

from input

of length

into output

of length

the composition of two edit operations. First, we

apply a fertility step, in which we decide for each

token what its fertility is, i.e. how many copies we

make of it. Assigning fertility of

corresponds to

deleting the token. This step yields an intermediate

sequence of tokens. We then reorder them using

permutation trees (see Section 3.2). In the last

step, we individually translate these tokens into

the output tokens. Fig. 1shows an example where

the fertility step and reordering apply to vector

representations of tokens.

The fertility step and the reordering can be

represented as boolean matrices

F∈Bn×l

and

R∈Bl×l

, respectively. These matrices denote

alignments between the sequences before and after

the operation. For example,

Fi,j = 1

means that

input token

aligns to intermediate token

, i.e.

is one of the copies of i.

With this conceptualisation, we would ideally

use the following probabilistic model P(y|x):

P(y|x) = EP(F|x)

| {z }

Fertility

EP(R|x, F )

| {z }

Reordering

P(y|x, F, R

| {z }

Decoder

)

At training time, the true fertility values are un-

known but we observe the length

, so we

condition on it:

P(y|x) = P(l|x)·EP(F,R|x,l)P(y|x, F, R)(2)

where

P(l|x)

can be computed with dynamic pro-

gramming relying on the fertility model, as we

explain in the next section.

Computing the marginal likelihood and the gra-

dients is intractable. Instead of computing the like-

lihood exactly, one could sample and use a score

function estimator (Williams,1992) but the result-

ing gradient estimates have high variance.

Instead, we use structured attention as discussed

in Section 3.1 and ‘push’ the expectations inside

the model:

EP(F,R|x,l)P(y|x, F, R)≈P(y|x,˜

F , ˜

R)(3)

with

F=EP(F|x,l)F

and

R=EP(R|x,˜

F)R

and

now represent ‘soft’, differentiable versions

of fertility and reordering. They approach their

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CompositionalGeneralisationwithStructuredReorderingandFertilityLayersMatthiasLindemann1andAlexanderKoller2andIvanTitov1;31ILCC,UniversityofEdinburgh,2LST,SaarlandUniversity,3ILLC,UniversityofAmsterdamm.m.lindemann@sms.ed.ac.uk,koller@coli.uni-saarland.de,ititov@inf.ed.ac.ukAbstractSeq2seqmodelshaveb...

展开>> 收起<<

Compositional Generalisation with Structured Reordering and Fertility Layers Matthias Lindemann1andAlexander Koller2andIvan Titov13.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Compositional Generalisation with Structured Reordering and Fertility Layers Matthias Lindemann1andAlexander Koller2andIvan Titov13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: