
SynGEC: Syntax-Enhanced Grammatical Error Correction
with a Tailored GEC-Oriented Parser
Yue Zhang1, Bo Zhang2, Zhenghua Li1˚, Zuyi Bao2, Chen Li2, Min Zhang1
1Institute of Artificial Intelligence, School of Computer Science and Technology,
Soochow University, China; 2DAMO Academy, Alibaba Group, China
1yzhang21@stu.suda.edu.cn,1{zhli13,minzhang}@suda.edu.cn
2{klayzhang.zb,zuyi.bzy,puji.lc}@alibaba-inc.com
Abstract
This work proposes a syntax-enhanced gram-
matical error correction (GEC) approach
named SynGEC that effectively incorporates
dependency syntactic information into the en-
coder part of GEC models.1The key challenge
for this idea is that off-the-shelf parsers are un-
reliable when processing ungrammatical sen-
tences. To confront this challenge, we pro-
pose to build a tailored GEC-oriented parser
(GOPar) using parallel GEC training data as a
pivot. First, we design an extended syntax rep-
resentation scheme that allows us to represent
both grammatical errors and syntax in a uni-
fied tree structure. Then, we obtain parse trees
of the source incorrect sentences by projecting
trees of the target correct sentences. Finally,
we train GOPar with such projected trees. For
GEC, we employ the graph convolution net-
work to encode source-side syntactic informa-
tion produced by GOPar, and fuse them with
the outputs of the Transformer encoder. Ex-
periments on mainstream English and Chinese
GEC datasets show that our proposed SynGEC
approach consistently and substantially outper-
forms strong baselines and achieves compet-
itive performance. Our code and data are
all publicly available at https://github.
com/HillZhang1999/SynGEC.
1 Introduction
Given an ungrammatical sentence, the grammat-
ical error correction (GEC) task aims to produce
a grammatical target sentence with the intended
meaning (Grundkiewicz et al.,2020;Wang et al.,
2021). Recent mainstream approaches treat GEC
as a monolingual machine translation (MT) task
(Yuan and Briscoe,2016;Junczys-Dowmunt et al.,
2018). Standard encoder-decoder based MT mod-
els, e.g., Transformer (Vaswani et al.,2017), have
˚Corresponding author.
1
Although this work focuses on the dependency syntax
structure, SynGEC can also be extended to the constituency
syntax structure straightforwardly.
emerged as a dominant paradigm and achieved
state-of-the-art (SOTA) results on various GEC
benchmarks (Rothe et al.,2021;Stahlberg and Ku-
mar,2021;Sun et al.,2022;Zhang et al.,2022).
Despite their impressive achievements, most work
treats the input sentence as a sequence of tokens,
without explicitly exploiting syntactic or semantic
information.
Compared with MT, GEC has two peculiarities
that directly motivate this work. First, the training
data for GEC models is much less abundant, which
may be alleviated by incorporating linguistic struc-
ture knowledge like syntax. As shown in Table 1
and Table 5, the English and Chinese GEC tasks
only have about 126K and 157K high-quality la-
belled source/target sentence pairs for training, if
not considering the highly noisy crowd-annotated
Lang8 data (Mita et al.,2020). Second, according
to our preliminary observation, many errors in un-
grammatical sentences are intrinsically correlated
with syntactic information. For example, errors like
inconsistency in tense or singular-vs-plural forms
can be better detected and corrected with the help
of long-range syntactic dependencies.
In this paper, we propose SynGEC, an approach
that can effectively inject the syntactic structure of
the input sentence into the encoder part of GEC
models. The critical challenge here is that off-the-
shelf parsers are unreliable when handling ungram-
matical sentences. On the one hand, off-the-shelf
parsers are trained on clean treebanks that only
consist of grammatical sentences. When parsing
ungrammatical sentences, their performance may
sharply degrade due to the input mismatch. On
the other hand, mainstream syntax representation
schemes, adopted by existing treebanks, do not
cover the non-canonical structures arising from
grammatical errors. In consequence, under such
schemes, it is sometimes difficult to find a plausible
syntactic tree to properly parse an ungrammatical
sentence (e.g., the sentence in Figure 1(d)).
arXiv:2210.12484v1 [cs.CL] 22 Oct 2022