thoroughly integrates its sequence-level context in-
formation, and a decoder refers to such contextu-
alized representations for generating a translation
sequence. A key reason of failure on composi-
tional generalization is that the correspondence
between pairs of token sequences is modeled as
a whole. Specifically, NMT models are trained
end-to-end over large parallel data without disen-
tangling the representation of individual words or
phrases from that of whole token sequences. At the
sequence level, the source input sample space is
highly sparse mainly due to semantic composition,
and small changes to a sentence can lead to out-of-
distribution issues (Sagawa et al.,2020;Conklin
et al.,2021;Liu et al.,2021).
Intuitively, one way to solve this problem is to
decouple token-level information from the source
sequence by injecting token-level translation distri-
bution (e.g.,
P(petit|small)
) into the source rep-
resentation. Given the fact that the source-side
contextualized representations encode rich token-
level translation information (Kasai et al.,2021;
Xu et al.,2021a), we categorize sparse token-level
contextualized source representations into a few
representative prototypes over training instances,
and make use of them to enrich source encoding.
In this way, when encoding a sequence, the model
observes less sparse prototypes of each token, thus
alleviating excessively memorizing the sequence-
level information.
We propose a two-stage framework to train
prototype-based Transformer models (Proto-
Transformer). In the first stage, we warm up an
initial Transformer model which can generate rea-
sonable representations. In the second stage, for
each token, we run the trained model to extract all
contextualized representations over the training cor-
pus. Then, we perform clustering (e.g., K-Means)
to obtain the prototype representations for each to-
ken. Take Figure 2as an example, for the token
“Toy”, we collect all the contextualized representa-
tions and cluster them into 3 prototypes. Finally,
we extend the base model by fusing the prototype
information back into the encoding process through
a prototype-attention module, and continue to train
the whole model until convergence.
Experimental results on CoGnition show that our
method significantly improves novel composition
translation by over 24% error reduction, demon-
strating the effectiveness for tackling the composi-
tional generalization problem. To further verify the
effectiveness on more datasets, we conduct experi-
ments on 10 commonly used MT benchmarks and
our method gives consistent BLEU improvement.
We also present empirical analysis for prototypes
and quantitative analysis on compositional gener-
alizaiton. The comparison between the one-pass
and the two-pass training procedure shows that the
one-pass method is both faster and more accurate
than the two-pass one, demonstrating that more
generalizable prototypes extracted from early train-
ing phrase are more beneficial to compositional
generalization. Additionally, quantitative analysis
demonstrates that our proposed model is better at
handling longer compounds and more difficult com-
position patterns. The code is publicly available at
https://github.com/ARIES-LM/CatMT4CG.git.
2 Related Work
Compositional Generalization
Recent work
(Lake and Baroni,2018;Keysers et al.,2020b)
has demonstrated weak compositionality of neural
models using dedicated datasets. Various methods
haven been proposed to solve the issue of composi-
tional generalization such as encoding more induc-
tive bias (Li et al.,2019;Korrel et al.,2019;Baan
et al.,2019;Chen et al.,2020a;Gordon et al.,2020;
Herzig and Berant,2021), meta-learning (Lake,
2019;Conklin et al.,2021), and data augmentation
(Andreas,2020;Akyürek et al.,2021). Recently,
Ontañón et al. (2021) and Csordás et al. (2021)
show that the Transformer architecture can per-
form better on compositional generalization with
some modifications. Although these methods have
demonstrated better generalization or interpretabil-
ity, most of them are limited small vocabulary and
limited samples semantic parsing datasets. In the
context of machine translation, Lake and Baroni
(2018) construct a small dataset where the training
data contains a word daxy along with its parallel
sentences of a single pattern (e.g., I am daxy,je
suis daxist) while the test set contains novel pat-
terns (e.g., He is daxy). However, the experiment
is limited in that the test set only consists of 8 sam-
ples. Different from existing work, Li et al. (2021)
propose a large dataset (CoGnition) and construct a
large-scale test set that contains newly constructed
constituents as novel compounds, so that general-
ization ability can be evaluated directly based on
compound translation error rate. We proposed a
method enhancing compositional generalization on
the dedicated dataset of Li et al. (2021), while at