
Composition, Attention, or Both?∗
Ryo Yoshida and Yohei Oseki
The University of Tokyo
{yoshiryo0617, oseki}@g.ecc.u-tokyo.ac.jp
Abstract
In this paper, we propose a novel architec-
ture called Composition Attention Gram-
mars (CAGs) that recursively compose sub-
trees into a single vector representation with
a composition function, and selectively at-
tend to previous structural information with
a self-attention mechanism. We investigate
whether these components—the composition
function and the self-attention mechanism—
can both induce human-like syntactic general-
ization. Specifically, we train language mod-
els (LMs) with and without these two com-
ponents with the model sizes carefully con-
trolled, and evaluate their syntactic generaliza-
tion performance against six test circuits on the
SyntaxGym benchmark. The results demon-
strated that the composition function and the
self-attention mechanism both play an impor-
tant role to make LMs more human-like, and
closer inspection of grammatical phenomena
implied that the composition function allowed
syntactic features, but not semantic features, to
percolate into subtree representations.
1 Introduction
Recently, language models (LMs) trained on large
datasets have achieved remarkable success in vari-
ous Natural Language Processing (NLP) tasks (cf.
Wang et al.,2019a,b). The literature of targeted
syntactic evaluations has shown that these mod-
els implicitly learn syntactic structures of natural
language, even though they do not receive explicit
syntactic supervision (Warstadt et al.,2020;Hu
et al.,2020).
However, previous work has also shown that
there is still a benefit for LMs to receive explicit
∗
While writing this paper, we noticed that Sartran et al.
(2022) was submitted to the arXiv, proposing Transformer
Grammars (TGs) that incorporate recursive syntactic composi-
tion via an attention mask. Their work and ours are similar in
spirit, but different in how to obtain a vector representation of
subtrees, making them complementary. We discuss the details
in Section 5.
syntactic supervision. Recurrent Neural Network
Grammars (RNNGs; Dyer et al.,2016), the integra-
tion of Recurrent Neural Networks (RNNs; Elman,
1990) with an explicit syntactic bias, have achieved
better syntactic generalization performance than
vanilla RNNs (Kuncoro et al.,2018;Wilcox et al.,
2019;Hu et al.,2020). In addition, previous work
has recommended RNNGs as a cognitively plau-
sible architecture, showing that RNNGs can suc-
cessfully predict human reading times (Yoshida
et al.,2021) or brain activities (Hale et al.,2018).
The key difference between RNNGs and RNNs is
a
composition function
, which recursively com-
poses subtrees into a single vector representation.
On the other hand, Transformer architectures
(Vaswani et al.,2017) have been shown to out-
perform RNN architectures in various NLP tasks
(Devlin et al.,2019). The key difference between
Transformers and RNNs here is a
self-attention
mechanism
, which selectively attends to previous
vectors to obtain sentence representations. Re-
cently, an attempt was made to investigate whether
Transformer architectures with the self-attention
mechanism also benefit from explicit syntactic su-
pervision (Qian et al.,2021), but their “Parsing as
Language Modeling (PLM)” approach (Choe and
Charniak,2016) does not employ the composition
function, which is essential for RNNGs. Therefore,
it is reasonable to hypothesize that their approach
may not achieve the full benefit of explicit syntactic
supervision.
In this paper, we propose a novel architec-
ture called
Composition Attention Grammars
(CAGs) that recursively compose subtrees into a
single vector representation with the composition
function, and selectively attend to previous struc-
tural information with the self-attention mecha-
nism. We investigate whether these components—
the composition function and the self-attention
mechanism—can both induce human-like syntac-
tic generalization. Specifically, we train LMs with
arXiv:2210.12958v3 [cs.CL] 11 May 2023