Compositional Generalisation with Structured Reordering and Fertility Layers Matthias Lindemann1andAlexander Koller2andIvan Titov13

2025-04-27 0 0 666.15KB 15 页 10玖币
侵权投诉
Compositional Generalisation
with Structured Reordering and Fertility Layers
Matthias Lindemann1and Alexander Koller2and Ivan Titov1,3
1ILCC, University of Edinburgh, 2LST, Saarland University, 3ILLC, University of Amsterdam
m.m.lindemann@sms.ed.ac.uk,koller@coli.uni-saarland.de,ititov@inf.ed.ac.uk
Abstract
Seq2seq models have been shown to strug-
gle with compositional generalisation, i.e. gen-
eralising to new and potentially more com-
plex structures than seen during training. Tak-
ing inspiration from grammar-based models
that excel at compositional generalisation, we
present a flexible end-to-end differentiable
neural model that composes two structural op-
erations: a fertility step, which we introduce
in this work, and a reordering step based on
previous work (Wang et al.,2021). To ensure
differentiability, we use the expected value
of each step, which we compute using dy-
namic programming. Our model outperforms
seq2seq models by a wide margin on chal-
lenging compositional splits of realistic seman-
tic parsing tasks that require generalisation to
longer examples. It also compares favourably
to other models targeting compositional gener-
alisation.1
1 Introduction
Many NLP tasks require translating an input object
such as a sentence into a structured output object
such as a semantic parse. Recently, these tasks
have been approached with seq2seq models with
great success. However, seq2seq models also have
been shown to struggle out-of-distribution on com-
positional generalisation (Lake and Baroni,2018;
Finegan-Dollak et al.,2018;Kim and Linzen,2020;
Hupkes et al.,2020), i.e. the model fails on exam-
ples that contain unseen compositions or deeper
recursion of phenomena that it handles correctly in
isolation.
Consider the example in Fig. 1. Arguably, any
model that produces the given semantic parse from
the input in a generalisable way has to capture the
correspondence between fragments of the input and
fragments of the output, at least implicitly. This
is challenging because of structural mismatches
1https://github.com/namednil/f-then-r
Figure 1: We model structural seq2seq tasks as the com-
position of differentiable fertility and phrase reordering
layers. The model is trained end-to-end without direct
supervision of the two structural layers.
between input and output. For example, the frag-
ment contributed by “flights" is discontinuous and
intertwined with the rest of the semantic parse.
In contrast to seq2seq models, grammar-based
models such as synchronous context-free grammars
(SCFGs) (Lewis and Stearns,1968;Chiang,2007)
explicitly capture the compositional process behind
the data and therefore perform very well in compo-
sitional generalisation setups. They can also model
common structural mismatches like the one shown
in the example. However, grammar-based models
are rigid and brittle and thus do not scale well.
In this work, we take inspiration from grammar-
based models and present an end-to-end differen-
tiable neural model that is both flexible and gen-
eralises well compositionally. Our model consists
of two structural layers: a phrase reordering layer
originally introduced by Wang et al. (2021) and a
fertility layer, new in this work, which creates zero
or more copies of the representation of any input
token. We show how to compose these layers to
achieve strong generalisation.
We use a simple decoder and essentially translate
each token after the fertility and reordering layers
arXiv:2210.03183v2 [cs.CL] 15 Feb 2023
independently into one output token. We found this
to lead to better generalisation than using an LSTM-
based decoder. The simple decoder also makes it
easy to integrate grammar constraints to ensure
well-formed outputs, e.g. in semantic parsing.
Most seq2seq models tend to predict the end of
the sequence too early on instances that are longer
than seen in training (Newman et al.,2020). Our
model overcomes this by explicitly predicting the
length of the output sequence as a sum of fertilities.
We first demonstrate the expressive capacity and
inductive bias of our model on synthetic data. We
then evaluate on three English semantic parsing
datasets (Geoquery, ATIS, Okapi) and a style trans-
fer task (Lyu et al.,2021). We clearly outperform
seq2seq models both with and without pretrain-
ing in structural generalisation setups, particularly
when the model has to generalise to longer exam-
ples. Our model also compares favourably with
existing approaches that target structural generali-
sation, and we obtain state-of-the-art results on the
structural style transfer tasks.
To summarise, our main contributions are:
an efficient differentiable fertility layer;
a flexible end-to-end differentiable model that
composes two structural operations (fertility
and reordering) and achieves strong perfor-
mance in structural generalisation tasks.
2 Related Work
Fertility
The concept of fertility was introduced
by Brown et al. (1990) for statistical machine trans-
lation to capture that a word in one language is
often consistently translated to a certain number of
words in another language. Tu et al. (2016) and
Malaviya et al. (2018) incorporate fertility into the
attention mechanism of seq2seq models. Cohn et al.
(2016); Gu et al. (2016) use heuristic supervision
for training their fertility models. In contrast to
prior work, we learn an explicit fertility component
jointly with the rest of our model.
Monotonic alignment
Related to fertility is the
concept of monotonic alignment, i.e. an alignment
a
that maps output positions to input positions
such that for any two output positions
i<j
,
a(i)a(j)
. Monotonic alignments are usually
modelled by an HMM-like model that places the
monotonicity constraint on the transition matrix
(Yu et al.,2016;Wu and Cotterell,2019), leading to
a runtime of
O(|x|2|y|)
with
x
being the input and
y
the output. Raffel et al. (2017) parameterise the
alignment using a series of Bernoulli trials and ob-
tain a training runtime of
O(|x||y|)
. Our approach
also has O(|x||y|)runtime.
Compositional generalisation
There is a grow-
ing body of work on improving the ability of neural
models to generalise compositionally in semantic
parsing. Good progress has been made in terms of
generalisation to new lexical items (Andreas,2020;
Akyürek and Andreas,2021;Conklin et al.,2021;
Csordás et al.,2021;Ontañón et al.,2022) but struc-
tural generalisation remains very challenging (Oren
et al.,2020;Bogin et al.,2022).
Herzig and Berant (2021) use a neural chart
parser and induce latent trees with an approach
similar to hard EM. Their model assumes that one
input token corresponds to a single leaf in the tree.
Zheng and Lapata (2022) re-encode the input and
partially generated output with a transformer for ev-
ery decoding step to reduce entanglement and show
considerable gains in structural generalisation.
There has also been work inspired by quasi-
synchronous grammars (QCFGs, Smith and Eisner
(2006)). Shaw et al. (2021) heuristically induce a
QCFG and create an ensemble of a QCFG-based
parser and a seq2seq model. Qiu et al. (2021) use
similar QCFGs for data augmentation for a seq2seq
model. Our approach does not require constructing
a grammar. Finally, Kim (2021) introduces neural
QCFGs which perform well on compositional gen-
eralisation tasks but are very compute-intensive.
Closest to our work is that of Wang et al. (2021)
who reorder phrases and use a monotonic atten-
tion mechanism on top. Our approach differs from
theirs in several important aspects: (i) we use fer-
tility instead of monotonic attention, which param-
eterises alignments differently; (ii) we apply the
fertility step first and then reorder the phrases, so
our model can directly create alignments where out-
put tokens aligned to the same input token do not
have to be consecutive; (iii) we predict the length
as the sum of the fertilities and not with an end-
of-sequence token; (iv) they use an LSTM-based
decoder network whereas we found that a simpler
decoder can generalise better.
3 Background
3.1 Structured Attention
We often want to model the relationship between
input
x
of length
n
and output
y
of length
l
by
means of a latent variable
Z
. In particular, we
assume that
Z∈ Z Bn×l
is a boolean alignment
matrix. We want to predict
y
from a representation
produced by a function
g(Z, x)
. In this case, the
marginal distribution
P(y|x) = EP(Z|x)P(y|g(Z, x)) (1)
can be intractable to compute because
Z
often has
exponential size in
x
. In some important cases
however we can formulate a similar but tractable
model by using structured attention (Kim et al.,
2017) which ‘pushes’ the expectation inside the
model:
P(y|x)P(y|g(˜
Z, x))
where
˜
Z=EP(Z|x)Z
is now a ‘soft’ rather than a
boolean matrix. Note that
˜
Zij =P(Zij = 1|x)
is
a marginal probability that often can be efficiently
computed with a dynamic programme if
P(Z|x)
is
factorisable. We will use such an approach for our
model.
3.2 Marginal Permutations
In this section, we briefly review the method of
Wang et al. (2021) that we use in our model.
Wang et al. (2021) build on bracketing trans-
duction grammars (Wu,1997) and show how to
compute a distribution over separable permutations
(Bose et al.,1998). A permutation is separable, iff
it can be represented as a permutation tree. Some
permutations are not separable. The internal nodes
of a permutation tree are labelled as
or
4
and
are interpreted as operations:
concatenates the
values it receives from its left child with the value
from its right child, whereas
4
concatenates them
in reverse order. For example, the permutation tree
t= (4(a b) (4c d))
represents the permutation
abcd dcab
. Let
Rt(i, j)=1
if the permutation
described by tree
t
maps position
i
to position
j
,
otherwise Rt(i, j) = 0.
Wang et al. (2021) show how to compute the ex-
pected permutation matrix
˜
Ri,j ,EP(t|x)Rt(i, j)
in polynomial time with a CYK-style algorithm if
P(t|x)
factors according to the CYK chart. Cru-
cially, the expected permutation can also be inter-
preted as a distribution over alignments, where
˜
Ri,j
is the marginal probability that position
i
aligns to
position
j
. We will use this alignment distribution
as a building block in our model.
4 Overview of the Approach
In this section, we give a general overview of our
approach and defer the details to Sections 5and 6.
Conceptually, we want to model the transduction
from input
x
of length
n
into output
y
of length
l
as
the composition of two edit operations. First, we
apply a fertility step, in which we decide for each
token what its fertility is, i.e. how many copies we
make of it. Assigning fertility of
0
corresponds to
deleting the token. This step yields an intermediate
sequence of tokens. We then reorder them using
permutation trees (see Section 3.2). In the last
step, we individually translate these tokens into
the output tokens. Fig. 1shows an example where
the fertility step and reordering apply to vector
representations of tokens.
The fertility step and the reordering can be
represented as boolean matrices
FBn×l
and
RBl×l
, respectively. These matrices denote
alignments between the sequences before and after
the operation. For example,
Fi,j = 1
means that
input token
i
aligns to intermediate token
j
, i.e.
j
is one of the copies of i.
With this conceptualisation, we would ideally
use the following probabilistic model P(y|x):
P(y|x) = EP(F|x)
| {z }
Fertility
EP(R|x, F )
| {z }
Reordering
P(y|x, F, R
| {z }
Decoder
)
At training time, the true fertility values are un-
known but we observe the length
l
of
y
, so we
condition on it:
P(y|x) = P(l|x)·EP(F,R|x,l)P(y|x, F, R)(2)
where
P(l|x)
can be computed with dynamic pro-
gramming relying on the fertility model, as we
explain in the next section.
Computing the marginal likelihood and the gra-
dients is intractable. Instead of computing the like-
lihood exactly, one could sample and use a score
function estimator (Williams,1992) but the result-
ing gradient estimates have high variance.
Instead, we use structured attention as discussed
in Section 3.1 and ‘push’ the expectations inside
the model:
EP(F,R|x,l)P(y|x, F, R)P(y|x,˜
F , ˜
R)(3)
with
˜
F=EP(F|x,l)F
and
˜
R=EP(R|x,˜
F)R
.
˜
F
and
˜
R
now represent ‘soft’, differentiable versions
of fertility and reordering. They approach their
摘要:

CompositionalGeneralisationwithStructuredReorderingandFertilityLayersMatthiasLindemann1andAlexanderKoller2andIvanTitov1;31ILCC,UniversityofEdinburgh,2LST,SaarlandUniversity,3ILLC,UniversityofAmsterdamm.m.lindemann@sms.ed.ac.uk,koller@coli.uni-saarland.de,ititov@inf.ed.ac.ukAbstractSeq2seqmodelshaveb...

展开>> 收起<<
Compositional Generalisation with Structured Reordering and Fertility Layers Matthias Lindemann1andAlexander Koller2andIvan Titov13.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:666.15KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注