SynGEC Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser Yue Zhang1 Bo Zhang2 Zhenghua Li1 Zuyi Bao2 Chen Li2 Min Zhang1

2025-05-02 0 0 664.54KB 14 页 10玖币
侵权投诉
SynGEC: Syntax-Enhanced Grammatical Error Correction
with a Tailored GEC-Oriented Parser
Yue Zhang1, Bo Zhang2, Zhenghua Li1˚, Zuyi Bao2, Chen Li2, Min Zhang1
1Institute of Artificial Intelligence, School of Computer Science and Technology,
Soochow University, China; 2DAMO Academy, Alibaba Group, China
1yzhang21@stu.suda.edu.cn,1{zhli13,minzhang}@suda.edu.cn
2{klayzhang.zb,zuyi.bzy,puji.lc}@alibaba-inc.com
Abstract
This work proposes a syntax-enhanced gram-
matical error correction (GEC) approach
named SynGEC that effectively incorporates
dependency syntactic information into the en-
coder part of GEC models.1The key challenge
for this idea is that off-the-shelf parsers are un-
reliable when processing ungrammatical sen-
tences. To confront this challenge, we pro-
pose to build a tailored GEC-oriented parser
(GOPar) using parallel GEC training data as a
pivot. First, we design an extended syntax rep-
resentation scheme that allows us to represent
both grammatical errors and syntax in a uni-
fied tree structure. Then, we obtain parse trees
of the source incorrect sentences by projecting
trees of the target correct sentences. Finally,
we train GOPar with such projected trees. For
GEC, we employ the graph convolution net-
work to encode source-side syntactic informa-
tion produced by GOPar, and fuse them with
the outputs of the Transformer encoder. Ex-
periments on mainstream English and Chinese
GEC datasets show that our proposed SynGEC
approach consistently and substantially outper-
forms strong baselines and achieves compet-
itive performance. Our code and data are
all publicly available at https://github.
com/HillZhang1999/SynGEC.
1 Introduction
Given an ungrammatical sentence, the grammat-
ical error correction (GEC) task aims to produce
a grammatical target sentence with the intended
meaning (Grundkiewicz et al.,2020;Wang et al.,
2021). Recent mainstream approaches treat GEC
as a monolingual machine translation (MT) task
(Yuan and Briscoe,2016;Junczys-Dowmunt et al.,
2018). Standard encoder-decoder based MT mod-
els, e.g., Transformer (Vaswani et al.,2017), have
˚Corresponding author.
1
Although this work focuses on the dependency syntax
structure, SynGEC can also be extended to the constituency
syntax structure straightforwardly.
emerged as a dominant paradigm and achieved
state-of-the-art (SOTA) results on various GEC
benchmarks (Rothe et al.,2021;Stahlberg and Ku-
mar,2021;Sun et al.,2022;Zhang et al.,2022).
Despite their impressive achievements, most work
treats the input sentence as a sequence of tokens,
without explicitly exploiting syntactic or semantic
information.
Compared with MT, GEC has two peculiarities
that directly motivate this work. First, the training
data for GEC models is much less abundant, which
may be alleviated by incorporating linguistic struc-
ture knowledge like syntax. As shown in Table 1
and Table 5, the English and Chinese GEC tasks
only have about 126K and 157K high-quality la-
belled source/target sentence pairs for training, if
not considering the highly noisy crowd-annotated
Lang8 data (Mita et al.,2020). Second, according
to our preliminary observation, many errors in un-
grammatical sentences are intrinsically correlated
with syntactic information. For example, errors like
inconsistency in tense or singular-vs-plural forms
can be better detected and corrected with the help
of long-range syntactic dependencies.
In this paper, we propose SynGEC, an approach
that can effectively inject the syntactic structure of
the input sentence into the encoder part of GEC
models. The critical challenge here is that off-the-
shelf parsers are unreliable when handling ungram-
matical sentences. On the one hand, off-the-shelf
parsers are trained on clean treebanks that only
consist of grammatical sentences. When parsing
ungrammatical sentences, their performance may
sharply degrade due to the input mismatch. On
the other hand, mainstream syntax representation
schemes, adopted by existing treebanks, do not
cover the non-canonical structures arising from
grammatical errors. In consequence, under such
schemes, it is sometimes difficult to find a plausible
syntactic tree to properly parse an ungrammatical
sentence (e.g., the sentence in Figure 1(d)).
arXiv:2210.12484v1 [cs.CL] 22 Oct 2022
But there were no buyers .
Root
cc
expl
nsubj
punct
det
(a) The correct sentence.
Bat there were no buyers .
Root
S
expl
nsubj
punct
det
(b) Substituted errors.
But there were no any buyers .
Root
cc
expl
nsubj
punct
det
R
(c) Redundant errors.
But there were no H.
Root
cc
expl
M
det
(d) Missing errors.
Figure 1: Illustration of our extended syntax representation scheme. Hdenotes the missing word.
Indeed, there have been several prior works that
try to improve syntactic parsing for ungrammatical
texts by annotating data (Dickinson and Ragheb,
2009;Berzak et al.,2016;Nagata and Sakaguchi,
2016). However, these works do not extend exist-
ing syntax representation schemes to accommodate
errors, which means that they make little change on
the original syntactic label sets. Besides, manual
annotation is expensive and time-consuming, so
their annotated treebanks for ungrammatical sen-
tences are of a relatively small scale.
To confront the challenge of unreliable perfor-
mance of off-the-shelf parsers on ungrammatical
sentences, we propose to train a tailored GEC-
oriented parser (GOPar). The basic idea is to utilize
parallel source/target sentence pairs in the GEC
training data. First, we parse the target correct sen-
tences using a vanilla off-the-shelf parser. Then, we
construct the tree for the source incorrect sentences
via tree projection. To accommodate grammatical
errors, we propose an extended syntax representa-
tion scheme based on several straightforward rules,
which allows us to represent both grammatical er-
rors and syntax in a unified tree structure. Finally,
we train GOPar directly on the automatically con-
structed trees of the source incorrect sentences in
the GEC training data. During both GEC training
and evaluation procedures, GOPar is used to gener-
ate syntactic information for the input sentences.
To incorporate syntactic information provided
by GOPar, we cascade several label-aware graph
convolutional network (GCN) layers (Kipf and
Welling,2017;Zhang et al.,2020a) above the
encoder of our baseline Transformer-based GEC
model. We conduct experiments on two widely-
used English GEC evaluation datasets, i.e., CoNLL-
14 (Ng et al.,2014) and BEA-19 (Bryant et al.,
2019), and two Chinese GEC evaluation datasets,
i.e., NLPCC-18 (Zhao et al.,2018) and MuCGEC
(Zhang et al.,2022). Extensive experimental re-
sults and in-depth analyses show that our SynGEC
approach achieves consistent and substantial im-
provement on all datasets, even when the baseline
model is enhanced with large pre-trained language
models (PLMs) like BART (Lewis et al.,2020),
and outperforming previous SOTA systems under
comparable settings.
2 Our GEC-Oriented Parser
This section describes our tailored GOPar, a de-
pendency parser that is more competent in parsing
ungrammatical sentences than off-the-shelf parsers.
2.1 Extended Syntax Representation Scheme
The standard scheme for representing dependency
syntax is originally designed for grammatical sen-
tences, and thus may not cover many non-canonical
structures in grammatically erroneous sentences.
Therefore, to obtain a tailored parser, our first task
is to extend the syntax representation scheme and,
more specifically, to design a complementary set
of rules to handle different grammatical mistakes.
With this scheme, we can directly use a unified tree
structure to represent both grammatical errors and
syntactic information.
As shown in Figure 1, we propose a light-weight
extended scheme based on several straightforward
rules, corresponding to the three types of grammat-
ical errors, i.e., substituted,redundant and missing
(Bryant et al.,2017).
2
In this work, we use the
Stanford Dependencies Scheme v3.3.0 (Dozat and
Manning,2017) as the basic scheme. The rules
are designed in such a way that we make as few
adjustments as possible to the syntactic tree of the
target correct sentence during tree projection. Cor-
respondingly, we add three labels into the original
syntactic label set, i.e., “S”, “R” and “M”, to cap-
ture three kinds of errors. Since such categorization
is also adopted in the grammatical error detection
(GED) task (Yuan et al.,2021), we refer to them as
GED labels.
2
We treat word-order errors as the combination of redun-
dant and missing errors in this work.
Privacy belongs to human rights .
Privicy protection belongs
to human rights .
Parallel GEC data
Grammatical sentence yi
Ungrammatical sentence xi
Privacy belongs to human rights .
nsubj
root
prep pobj
amod
punct
Dependency tree of yi
Parsing
Aligning
A 0 1|||S|||Privacy|||REQUIRED|||-NONE-|||1
A1 2|||R|||||REQUIRED|||-NONE-|||1
Extracted errors
Sroot
prep
pobj
amod
punct
Dependency tree of xi
Privicy protection belongs to human rights .
Projecting
R
Off-the-shelf parser
Tailored GOPar
Training
Figure 2: The workflow for obtaining our tailored GOPar.
Substituted errors (S)
include spelling er-
rors, tense errors, singular/plural inconsis-
tency errors, etc. For simplicity, we do not
consider such fine-grained categories, and di-
rectly use a single “S” label to indicate that
the word should be replaced by another one,
as shown in Figure 1(b).
Redundant errors (R)
mean that some words
should be deleted. For each redundant word,
we let it depend on its right-side adjacent
word, with a label “R”,
3
as shown in Figure
1(c). When the redundant word is at the end
of the sentence, we instead let it depend on its
left-side adjacent word.
Missing errors (M)
mean that some words
should be inserted. For each missing word,
we assign a label “M” to the incoming arc
of its right-side adjacent word, as shown in
Figure 1(d). When the missing word is at
the end of the sentence, we keep the original
tree unchanged. If several consecutive words
are missing, the structure remains the same
as when a single word is missing. Moreover,
since a missing word may have children in
the tree of the correct sentence, we let them
depend on the head word of the missing word,
without changing their syntactic labels.
Limitation discussion.
Our extended scheme
may encounter problems when different types of
errors occur consecutively. Taking “But was no
buyers” as an example, we need to replace “was”
with “were” and then insert “there” before “were”
at the same time. Therefore, according to our rules,
the label of the incoming arc of “was” can be ei-
ther “S” or “M”, leading to a label conflict. To
decide a unique label, we simply define a priority
order: “S”
ą
“R”
ą
“M’. Overall, the current version
of our scheme is imperfect, and there are still many
3Even if the right-side adjacent word is redundant.
points that can be improved. For example, when
confronting substituted and redundant errors, some
original labels will be overwritten by the GED label
“S” and “M”, which may cause the loss of some
valuable information. One possible solution is to
combine GED and syntax labels and use joint la-
bels like “S-Root”, “M-Subj”, etc. We leave such
extensions of our scheme as future work.
2.2 Training GOPar
With the extended syntax representation scheme,
we propose to train our tailored GOPar by using
the parallel GEC training data D“ tpxi, yiqu as a
pivot. The major goal is to automatically generate
high-quality parse trees for large-scale sentences
with realistic grammatical errors, and use them to
train a parser suitable for parsing ungrammatical
sentences. Figure 2illustrates the workflow, con-
sisting of the following four steps.
First, we use an off-the-shelf parser to parse the
target correct sentences (i.e.,
yi
) of the GEC train-
ing data. The off-the-shelf parser can produce reli-
able parse trees for target-side sentences since they
are (ideally) free from grammatical errors.
Second, we employ ERRANT (Bryant et al.,
2017)
4
to extract all grammatical errors in the
source incorrect sentence (i.e.,
xi
) according to
the alignments between
xi
and
yi
. The errors ex-
tracted by ERRANT mainly contain 3 parts: the
start and end positions of errors in source sentences,
the corresponding corrections, and the error types.
Third, we construct the tree of
xi
by projecting
the target-side tree of
yi
to the source side. For
words that are not related to any errors, dependen-
cies and labels are directly copied; for those related
to errors, dependencies and labels are assigned ac-
cording to the rules introduced in Section 2.1.
Fourth, with constructed parse trees for all
source-side sentences in
D
, we then use them as a
4https://github.com/chrisjbryant/
errant
摘要:

SynGEC:Syntax-EnhancedGrammaticalErrorCorrectionwithaTailoredGEC-OrientedParserYueZhang1,BoZhang2,ZhenghuaLi1,ZuyiBao2,ChenLi2,MinZhang11InstituteofArticialIntelligence,SchoolofComputerScienceandTechnology,SoochowUniversity,China;2DAMOAcademy,AlibabaGroup,China1yzhang21@stu.suda.edu.cn,1{zhli13,mi...

展开>> 收起<<
SynGEC Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser Yue Zhang1 Bo Zhang2 Zhenghua Li1 Zuyi Bao2 Chen Li2 Min Zhang1.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:664.54KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注