independently into one output token. We found this
to lead to better generalisation than using an LSTM-
based decoder. The simple decoder also makes it
easy to integrate grammar constraints to ensure
well-formed outputs, e.g. in semantic parsing.
Most seq2seq models tend to predict the end of
the sequence too early on instances that are longer
than seen in training (Newman et al.,2020). Our
model overcomes this by explicitly predicting the
length of the output sequence as a sum of fertilities.
We first demonstrate the expressive capacity and
inductive bias of our model on synthetic data. We
then evaluate on three English semantic parsing
datasets (Geoquery, ATIS, Okapi) and a style trans-
fer task (Lyu et al.,2021). We clearly outperform
seq2seq models both with and without pretrain-
ing in structural generalisation setups, particularly
when the model has to generalise to longer exam-
ples. Our model also compares favourably with
existing approaches that target structural generali-
sation, and we obtain state-of-the-art results on the
structural style transfer tasks.
To summarise, our main contributions are:
• an efficient differentiable fertility layer;
•
a flexible end-to-end differentiable model that
composes two structural operations (fertility
and reordering) and achieves strong perfor-
mance in structural generalisation tasks.
2 Related Work
Fertility
The concept of fertility was introduced
by Brown et al. (1990) for statistical machine trans-
lation to capture that a word in one language is
often consistently translated to a certain number of
words in another language. Tu et al. (2016) and
Malaviya et al. (2018) incorporate fertility into the
attention mechanism of seq2seq models. Cohn et al.
(2016); Gu et al. (2016) use heuristic supervision
for training their fertility models. In contrast to
prior work, we learn an explicit fertility component
jointly with the rest of our model.
Monotonic alignment
Related to fertility is the
concept of monotonic alignment, i.e. an alignment
a
that maps output positions to input positions
such that for any two output positions
i<j
,
a(i)≤a(j)
. Monotonic alignments are usually
modelled by an HMM-like model that places the
monotonicity constraint on the transition matrix
(Yu et al.,2016;Wu and Cotterell,2019), leading to
a runtime of
O(|x|2|y|)
with
x
being the input and
y
the output. Raffel et al. (2017) parameterise the
alignment using a series of Bernoulli trials and ob-
tain a training runtime of
O(|x||y|)
. Our approach
also has O(|x||y|)runtime.
Compositional generalisation
There is a grow-
ing body of work on improving the ability of neural
models to generalise compositionally in semantic
parsing. Good progress has been made in terms of
generalisation to new lexical items (Andreas,2020;
Akyürek and Andreas,2021;Conklin et al.,2021;
Csordás et al.,2021;Ontañón et al.,2022) but struc-
tural generalisation remains very challenging (Oren
et al.,2020;Bogin et al.,2022).
Herzig and Berant (2021) use a neural chart
parser and induce latent trees with an approach
similar to hard EM. Their model assumes that one
input token corresponds to a single leaf in the tree.
Zheng and Lapata (2022) re-encode the input and
partially generated output with a transformer for ev-
ery decoding step to reduce entanglement and show
considerable gains in structural generalisation.
There has also been work inspired by quasi-
synchronous grammars (QCFGs, Smith and Eisner
(2006)). Shaw et al. (2021) heuristically induce a
QCFG and create an ensemble of a QCFG-based
parser and a seq2seq model. Qiu et al. (2021) use
similar QCFGs for data augmentation for a seq2seq
model. Our approach does not require constructing
a grammar. Finally, Kim (2021) introduces neural
QCFGs which perform well on compositional gen-
eralisation tasks but are very compute-intensive.
Closest to our work is that of Wang et al. (2021)
who reorder phrases and use a monotonic atten-
tion mechanism on top. Our approach differs from
theirs in several important aspects: (i) we use fer-
tility instead of monotonic attention, which param-
eterises alignments differently; (ii) we apply the
fertility step first and then reorder the phrases, so
our model can directly create alignments where out-
put tokens aligned to the same input token do not
have to be consecutive; (iii) we predict the length
as the sum of the fertilities and not with an end-
of-sequence token; (iv) they use an LSTM-based
decoder network whereas we found that a simpler
decoder can generalise better.
3 Background
3.1 Structured Attention
We often want to model the relationship between
input
x
of length
n
and output
y
of length
l
by