
Linguistic Rules-Based Corpus Generation for Native Chinese
Grammatical Error Correction
Shirong Ma1∗
, Yinghui Li1∗, Rongyi Sun1, Qingyu Zhou2, Shuling Huang1, Ding Zhang1,
Yangning Li1,Ruiyang Liu4,Zhongli Li2,Yunbo Cao2,Hai-Tao Zheng1,3†
,Ying Shen5†
1Tsinghua Shenzhen International Graduate School, Tsinghua University
2Tencent Cloud Xiaowei, 3Peng Cheng Laboratory
4Department of Computer Science and Technology, Tsinghua University
5School of Intelligent Systems Engineering, Sun-Yat Sen University
{masr21,liyinghu20}@mails.tsinghua.edu.cn
Abstract
Chinese Grammatical Error Correction
(CGEC) is both a challenging NLP task and
a common application in human daily life.
Recently, many data-driven approaches are
proposed for the development of CGEC
research. However, there are two major
limitations in the CGEC field: First, the lack
of high-quality annotated training corpora
prevents the performance of existing CGEC
models from being significantly improved.
Second, the grammatical errors in widely used
test sets are not made by native Chinese speak-
ers, resulting in a significant gap between the
CGEC models and the real application. In
this paper, we propose a linguistic rules-based
approach to construct large-scale CGEC
training corpora with automatically generated
grammatical errors. Additionally, we present
a challenging CGEC benchmark derived
entirely from errors made by native Chinese
speakers in real-world scenarios. Extensive
experiments1and detailed analyses not only
demonstrate that the training data constructed
by our method effectively improves the
performance of CGEC models, but also reflect
that our benchmark is an excellent resource
for further development of the CGEC field.
1 Introduction
In the field of theoretical linguistics, the native
speaker is the authority of the grammar.
- Noam Chomsky
Chinese Grammatical Error Correction (CGEC)
aims to automatically correct grammatical errors
that violate language rules and converts the noisy
input texts to clean output texts (Wang et al.,
∗∗
indicates equal contribution. Work is done during
Yinghui’s internship at Tencent Cloud Xiaowei.
††
Corresponding author: Hai-Tao Zheng and
Ying Shen. (E-mail: zheng.haitao@sz.tsinghua.edu.cn,
sheny76@mail.sysu.edu.cn)
1
Our dataset and source codes are available at
https://
github.com/masr2000/CLG-CGEC.
2020c). In recent years, CGEC has attracted more
and more attention from NLP researchers due to its
broader applications in all kinds of daily scenarios
and downstream tasks (Duan and Hsu,2011;Kubis
et al.,2019;Omelianchuk et al.,2020).
With the progress of deep learning, data-driven
methods based on neural networks, e.g., Trans-
former (Vaswani et al.,2017), have become the
mainstream for CGEC (Zhao and Wang,2020;
Tang et al.,2021;Zhang et al.,2022;Li et al.,
2022a). However, we argue that there are still two
problems in CGEC: (1)
For model training
, ow-
ing to the limited number of real sentences contain-
ing grammatical errors, the long-term lack of high-
quality annotated training corpora hinders many
data-driven models from exercising their capabili-
ties on the CGEC task. (2)
For model evaluation
,
the widely used benchmarks such as NLPCC (Zhao
et al.,2018) and CGED (Rao et al.,2018,2020)
are all derived from the grammatical errors made
by foreign Chinese learners (i.e., L2 learners) in
their process of learning Chinese, the gap between
the language usage habits of L2 learners and Chi-
nese native speakers makes the performance of the
CGEC models in real scenarios unpredictable.
As illustrated in Table 1, the samples of NLPCC
and CGED are both from L2 learners, so their
sentence structures are relatively short and sim-
ple. More crucially, the grammatical errors in these
samples are very obvious and naive. On the other
hand, in the third example, the erroneous sentence
is fluent on the whole, which shows that the gram-
matical errors made by native speakers are more
subtle, that is, they are actually in line with the
habit of speaking in people’s daily communication,
but they do not conform to linguistic norms. There-
fore, in the broader scenarios of Chinese usage
besides foreigners learning Chinese, we believe
that wrong sentences made by native speakers are
more valuable and can better evaluate the model
performance than errors made by L2 learners.
arXiv:2210.10442v1 [cs.CL] 19 Oct 2022