
2008), wherein the hierarchical structuring of spans
as constituents is supplemented with labelling of
the dependency-like grammatical functions be-
tween them.
CGEL posits a distributionally-defined set of
nine lexical categories, from which we developed
a part-of-speech tagset with 11 tags: N (noun),
N
pro
(pronoun), V (verb), V
aux
(auxiliary verb), P
(preposition), D (determinative), Adj (adjective),
Adv (adverb), Sdr (subordinator), Coordinator, and
Int (interjection). Pronouns and proper nouns are a
subset of nouns, though we have created a distinct
tag for pronouns; auxiliary verbs are a subset of
verbs. All of these categories except subordinator
and coordinator project higher-level phrasal con-
stituents, e.g. N
←
Nom (nominal)
←
NP (noun
phrase).
Each CGEL phrase has exactly one Head func-
tion along with zero or more dependents. There
are also a few non-phrasal dependent constituents
in flat structures (e.g. Coordinate). Phrases are
typically binary- or unary-branching, but
n
-ary
branches are also possible. CGEL contrasts ad-
juncts (Mod,Supplement) with complements (Comp
and subtypes shown in figure 4of appendix A).
It also analyses many complex syntactic phenom-
ena often ignored in computational work, e.g. gaps
(§3.3) and scope of coordination (§3.3.4).
Multilingual syntactic formalisms such as Uni-
versal Dependencies (UD; Nivre et al.,2016) lack
the expressivity of CGEL for specifically English
due to the necessity of cross-lingual consistency.
2
CGEL also differs from the Penn Treebank (PTB;
Marcus et al.,1994) in associating grammatical
functions to each edge between phrasal categories
in a tree, adopting the concept of headedness from
dependency grammars. CGEL also analyzes “trans-
formational” phenomena in English with gaps, but
in a less overwhelming manner than PTB, limit-
ing their uses to ellipsis and relations that would
otherwise be non-projective.
As a whole, we find that CGEL combines the
best of both worlds—comprehensive analysis of
complex and long-tail linguistic phenomena with
the minimal parse complexity needed to express
those, as well as both dependency and constituency
layers. And it does so in human-readable fashion,
precisely defining its terminology and defending
2
E.g., the general principle of content-heads, leading to
verbal auxiliaries being treated as dependents of the main verb
in English but main-verb copulae (which are not used in some
languages) being analysed as dependents of the predicate.
its analyses in a nearly 2000-page volume that is
widely referenced by linguists. Finally, a com-
panion textbook (Huddleston and Pullum,2005;
Huddleston et al.,2021) introduces language learn-
ers to the major points of English syntax. A CGEL-
style treebank (CGELBank), potentially with a
parser, would therefore be of interest to English
learners familiar with the framework.
3 Linguistic decisions
Despite its breadth, depth, and specificity (Culi-
cover,2004), there are elements of English gram-
mar that are not fully specified in CGEL. While
CGEL does use some corpus data in its analyses
and descriptions of English, the grammar is largely
based on contrived examples. CGEL also only
describes an idealized standard variety of English—
internet-sourced text is out-of-distribution. As a
treebank forces us to be explicit in our linguistic
decisions, here we describe a number of the issues
we ran into and the (not necessarily final) decisions
we made.
3.1 Categorizing individual lexemes
Designing part-of-speech (POS) tagsets and delin-
eating boundaries between tags has long been a
contentious problem in treebanking (Atwell,2008).
CGEL is detailed but non-exhaustive in this regard.
In developing CGELBank, we had to collate the
many mentions of lexemes and their categories dis-
tributed throughout CGEL,
3
along with the careful
application of CGEL principles to the categoriza-
tion of hundredes of lexemes not explicitly men-
tioned.
4
Examples of words categorized this way
include the determinative said (e.g., as in
said
con-
tract), the coordinator slash (e.g., Dear God
slash
Allah
slash
Buddha
slash
Zeus), and the preposi-
tion o’clock (Pullum and Reynolds,2013).
3.2 Simplifying and un-simplifying
As shown in CGEL’s figure 5b (p. 48; reproduced
in the appendix as figure 5), CGEL uses a variety of
subtypes of head within clause structure: Nucleus
is the head of a clause which is itself a clause,
Predicate is a VP that is the head of a clause, and
Predicator is the V that is the head of a VP. We
3
E.g., “One borderline case is else. This is an adverb
when following or, as in Hurry up or else you’ll miss the bus,
but arguably a preposition when it postmodifies interrogative
heads and compound determinatives.” (CGEL, fn. 5, p. 15)
4
Carried out since 2006 in consultation with Huddleston
and Pullum and recorded in the Simple English Wiktionary.