CGELBank CGEL as a Framework for English Syntax Annotation Brett Reynolds Humber College_2

2025-04-30 0 0 280.52KB 12 页 10玖币
侵权投诉
CGELBank: CGEL as a Framework for English Syntax Annotation
Brett Reynolds
Humber College
brett.reynolds@humber.ca
Aryaman Arora Nathan Schneider
Georgetown University
{aa2190,nathan.schneider}@georgetown.edu
Abstract
We introduce the syntactic formalism of the
Cambridge Grammar of the English Language
(CGEL) to the world of treebanking through
the CGELBank project. We discuss some is-
sues in linguistic analysis that arose in adapt-
ing the formalism to corpus annotation, fol-
lowed by quantitative and qualitative compar-
isons with parallel UD and PTB treebanks. We
argue that CGEL provides a good tradeoff be-
tween comprehensiveness of analysis and us-
ability for annotation, which motivates expand-
ing the treebank with automatic conversion in
the future.
1 Introduction
Parsing hierarchical syntactic structure is a central
endeavour in computational linguistics (Jurafsky
and Martin,2021;Kübler et al.,2009). Many syn-
tactic theories and annotation frameworks exist.
The venerable Penn Treebank (PTB; Marcus et al.,
1994) has been the leading approach to annotating
English constituent structure; its detailed annota-
tion guidelines have been applied to many corpora
over the years (e.g., Bies et al.,2012;Pradhan et al.,
2013). Other theories applied to large-scale annota-
tion for English have included Universal Dependen-
cies (UD; Nivre et al.,2016), Combinatory Catego-
rial Grammar (Hockenmaier and Steedman,2007),
Role and Reference Grammar (Bladier et al.,2018),
Head-Driven Phrase Structure Grammar (Miyao
et al.,2004), and so on. Each formalism makes
different theoretical claims (e.g., are there transfor-
mations?) and computational tradeoffs (e.g. com-
plexity vs. parsing efficiency).
In this paper, we introduce a treebank for En-
glish built on the syntactic formalism of the Cam-
bridge Grammar of the English Language (
CGEL
;
Huddleston and Pullum,2002). CGEL is the au-
thoritative descriptive grammar for English and
analyses many syntactic phenomena in extreme
detail with minimal theoretical claims (§2). Our
Clause
Prenucleus:
NPx
Determiner-Head:
DP
Head:
D
which
Head:
Nom
Head:
Clause
Subj:
NP
Head:
Nom
Head:
N
Liz
Head:
VP
Head:
V
bought
Obj:
GAPx
Figure 1: CGELBank-style tree for interrogative
clause which Liz bought (as in There are three options;
I wonder which Liz bought).
motivations for introducing yet another treebank
are: (a) existing annotation guidelines for other
formalisms cannot approach the depth of CGEL;
(b) neither of the most common frameworks for
descriptive annotation in English (PTB and UD)
offers a coherent account of both constituent struc-
ture and grammatical functions; and (c) annotating
a real-world treebank is a strong test of the expres-
sivity and consistency of the CGEL formalism.
Below, we delve into the linguistic issues sur-
mounted in building the initial version of
CGEL-
Bank
3).
1
We then compare the CGEL trees
with UD and PTB trees of the same sentences (§4).
2 What is CGEL?
CGEL is the most comprehensive and up-to-date
descriptive grammar of English (Brew,2003). It
uses a morphosyntactic formalism for describing
English from first principles (Pullum and Rogers,
1https://github.com/nert-nlp/cgel/
arXiv:2210.00394v1 [cs.CL] 1 Oct 2022
2008), wherein the hierarchical structuring of spans
as constituents is supplemented with labelling of
the dependency-like grammatical functions be-
tween them.
CGEL posits a distributionally-defined set of
nine lexical categories, from which we developed
a part-of-speech tagset with 11 tags: N (noun),
N
pro
(pronoun), V (verb), V
aux
(auxiliary verb), P
(preposition), D (determinative), Adj (adjective),
Adv (adverb), Sdr (subordinator), Coordinator, and
Int (interjection). Pronouns and proper nouns are a
subset of nouns, though we have created a distinct
tag for pronouns; auxiliary verbs are a subset of
verbs. All of these categories except subordinator
and coordinator project higher-level phrasal con-
stituents, e.g. N
Nom (nominal)
NP (noun
phrase).
Each CGEL phrase has exactly one Head func-
tion along with zero or more dependents. There
are also a few non-phrasal dependent constituents
in flat structures (e.g. Coordinate). Phrases are
typically binary- or unary-branching, but
n
-ary
branches are also possible. CGEL contrasts ad-
juncts (Mod,Supplement) with complements (Comp
and subtypes shown in figure 4of appendix A).
It also analyses many complex syntactic phenom-
ena often ignored in computational work, e.g. gaps
3.3) and scope of coordination (§3.3.4).
Multilingual syntactic formalisms such as Uni-
versal Dependencies (UD; Nivre et al.,2016) lack
the expressivity of CGEL for specifically English
due to the necessity of cross-lingual consistency.
2
CGEL also differs from the Penn Treebank (PTB;
Marcus et al.,1994) in associating grammatical
functions to each edge between phrasal categories
in a tree, adopting the concept of headedness from
dependency grammars. CGEL also analyzes “trans-
formational” phenomena in English with gaps, but
in a less overwhelming manner than PTB, limit-
ing their uses to ellipsis and relations that would
otherwise be non-projective.
As a whole, we find that CGEL combines the
best of both worlds—comprehensive analysis of
complex and long-tail linguistic phenomena with
the minimal parse complexity needed to express
those, as well as both dependency and constituency
layers. And it does so in human-readable fashion,
precisely defining its terminology and defending
2
E.g., the general principle of content-heads, leading to
verbal auxiliaries being treated as dependents of the main verb
in English but main-verb copulae (which are not used in some
languages) being analysed as dependents of the predicate.
its analyses in a nearly 2000-page volume that is
widely referenced by linguists. Finally, a com-
panion textbook (Huddleston and Pullum,2005;
Huddleston et al.,2021) introduces language learn-
ers to the major points of English syntax. A CGEL-
style treebank (CGELBank), potentially with a
parser, would therefore be of interest to English
learners familiar with the framework.
3 Linguistic decisions
Despite its breadth, depth, and specificity (Culi-
cover,2004), there are elements of English gram-
mar that are not fully specified in CGEL. While
CGEL does use some corpus data in its analyses
and descriptions of English, the grammar is largely
based on contrived examples. CGEL also only
describes an idealized standard variety of English—
internet-sourced text is out-of-distribution. As a
treebank forces us to be explicit in our linguistic
decisions, here we describe a number of the issues
we ran into and the (not necessarily final) decisions
we made.
3.1 Categorizing individual lexemes
Designing part-of-speech (POS) tagsets and delin-
eating boundaries between tags has long been a
contentious problem in treebanking (Atwell,2008).
CGEL is detailed but non-exhaustive in this regard.
In developing CGELBank, we had to collate the
many mentions of lexemes and their categories dis-
tributed throughout CGEL,
3
along with the careful
application of CGEL principles to the categoriza-
tion of hundredes of lexemes not explicitly men-
tioned.
4
Examples of words categorized this way
include the determinative said (e.g., as in
said
con-
tract), the coordinator slash (e.g., Dear God
slash
Allah
slash
Buddha
slash
Zeus), and the preposi-
tion o’clock (Pullum and Reynolds,2013).
3.2 Simplifying and un-simplifying
As shown in CGELs figure 5b (p. 48; reproduced
in the appendix as figure 5), CGEL uses a variety of
subtypes of head within clause structure: Nucleus
is the head of a clause which is itself a clause,
Predicate is a VP that is the head of a clause, and
Predicator is the V that is the head of a VP. We
3
E.g., “One borderline case is else. This is an adverb
when following or, as in Hurry up or else you’ll miss the bus,
but arguably a preposition when it postmodifies interrogative
heads and compound determinatives. (CGEL, fn. 5, p. 15)
4
Carried out since 2006 in consultation with Huddleston
and Pullum and recorded in the Simple English Wiktionary.
dispense with these subtypes, simply using Head in
all cases, as shown in figure 1.
In some cases, CGEL simplifies tree structures
by removing intermediate unary nodes, such as
by removing an intermediate Head:Nom between
Head:N and its projected NP. We always include
such nodes, as in figure 1. Moreover, we show
complements as sisters of N and V but modifiers as
sisters of Nom and VP, where CGEL again simpli-
fies at times.
3.3 Gaps
CGEL posits gaps in tree structures when a con-
stituent appears in prenucleus position, as in which
Liz bought in figure 1, but is inconsistent in indicat-
ing it. In most cases, we have decided to explicitly
indicate a gap. We identify the following unclear
cases and present our decisions.
3.3.1 Subject gaps
In open interrogatives such as (1a), “inversion ac-
companies the placement in prenuclear position of
a non-subject interrogative phrase” (CGEL, p. 95),
while there is no inversion in those like (1b).
(1) a. What did she tell you?
b. Who told you that?
This suggests two possible analyses for the struc-
ture of (1b): either they both have a prenucleus and
a co-indexed gap, or only (a) does, the who in (b)
being a normal subject (Maling,2000). Unfortu-
nately, it is not clear to us which position CGEL
takes. The discussion on p. 96 suggests that there is
no inversion in (1b)
because
there is no prenucleus,
and thus no gap. However, this is not a consistent
rule throughout the text.
5
Given the ambiguity, we
have taken what we see as the standard position
that there is indeed a gap (e.g. Maling,2000;Bies
et al.,1995) in clauses with subjects that have been
questioned or relativized.
3.3.2 Adjunct gaps
The issue here concerns adjuncts such as the PP in
after lunch, we left and whether or not they are co-
indexed to a VP-final gap. CGEL recognizes that
“adjuncts may also be located in prenuclear position”
(p. 1372), but “there need be no anaphoric link with
a pronoun [or gap] in the nucleus of the clause”
5
CGEL recognizes explicitly that a subject gap may exist in
a construction like
Whoi
[do you think [
i
was responsible]]?
(p. 1082), but this involves a subordinate clause.
(p. 1410), as in
As for the concert-hall
, the archi-
tect excelled herself, where the underlined section
is in prenucleus position but cannot appear clause
finally. Other examples of items appearing in pre-
or postnuclear position without a corresponding
gap include the relative clause in an it-cleft like
It’s the director
who was sacked
and markers (e.g.,
whether it works, p. 956).
Because adjuncts appear in a variety of locations
in clause structure, with some not appearing clause
finally, we have decided against including a gap.
The exception is in relative and open interrogative
clauses, where CGEL explicitly marks a gap (e.g.,
That’s not the reason [why
i
[he did it
i
]], p. 1086).
3.3.3 Phrasal genitives
CGEL has the genitive suffix s attaching to the last
word in an NP such as [somebody local’s]. In the
case of an NP ending in a gap, we take the s as
attached to that gap, as in a guy I know s house.
3.3.4 Coordination and comparatives
CGEL calls coordinations such as
The PM
arrived
at six
and
the Queen an hour later
Gapped Coor-
dination, and writes examples like this with a “gap”
as shown here between the Queen and an hour
later (p. 1338). But this is not the same kind of gap
we find in long-distance dependencies such as the
subject relatives discussed above; rather, this is el-
lipsis. Consequently, CGELBank does not include
a gap in tree structure. Instead, we have a nonce
coordinate NP + PP with daughters Subj:NP + Pred-
Comp:PP or the like. Similarly, we do not treat the
“gaps” in comparatives as gaps in tree structure.
3.4 Branching & tree structure
3.4.1 Extraposition
CGEL posits extraposed subjects and objects, such
as the underlined clause in It’s a good thing
that we
left early
,
6
saying that its extraposed subject is “in
a matrix clause containing be + a short predicative
complement” (p. 953). Unfortunately, this leaves
the precise structure unclear.
After considering various options, we have de-
cided to attach the extraposed constituent as a sec-
ond complement in the VP with ternary branching.
Despite the seeming difference implied by the la-
bels “extraposed” and “internalized”, we see this
as analogous to the position of the internalized
6
These are semantic agents, patients, etc., but not syntactic
Subj and Obj.
摘要:

CGELBank:CGELasaFrameworkforEnglishSyntaxAnnotationBrettReynoldsHumberCollegebrett.reynolds@humber.caAryamanAroraNathanSchneiderGeorgetownUniversity{aa2190,nathan.schneider}@georgetown.eduAbstractWeintroducethesyntacticformalismoftheCambridgeGrammaroftheEnglishLanguage(CGEL)totheworldoftreebankingth...

收起<<
CGELBank CGEL as a Framework for English Syntax Annotation Brett Reynolds Humber College_2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:280.52KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注