
FCGEC: Fine-Grained Corpus for Chinese Grammatical Error
Correction
Lvxiaowei Xu1, Jianwang Wu1, Jiawei Peng1, Jiayu Fu2, Ming Cai1∗
1Department of Computer Science and Technology, Zhejiang University
2Base Station Platform Software Development Dept, Huawei Co., Ltd
1{xlxw, wujw, pengjw, cm}@zju.edu.cn
2fionafu0808@gmail.com
Abstract
Grammatical Error Correction (GEC) has been
broadly applied in automatic correction and
proofreading system recently. However, it is
still immature in Chinese GEC due to lim-
ited high-quality data from native speakers
in terms of category and scale. In this pa-
per, we present FCGEC, a fine-grained cor-
pus to detect, identify and correct the gram-
matical errors. FCGEC is a human-annotated
corpus with multiple references, consisting
of 41,340 sentences collected mainly from
multi-choice questions in public school Chi-
nese examinations. Furthermore, we pro-
pose a Switch-Tagger-Generator (STG) base-
line model to correct the grammatical errors
in low-resource settings. Compared to other
GEC benchmark models, experimental results
illustrate that STG outperforms them on our
FCGEC. However, there exists a significant
gap between benchmark models and humans
that encourages future models to bridge it. Our
annotation corpus and codes are available at
https://github.com/xlxwalex/FCGEC†.
1 Introduction
Grammatical error correction (GEC) is a complex
task, aiming at detecting, identifying and correct-
ing various grammatical errors in a given sentence.
GEC has recently attracted more attention due to
its ability to correct and proofread the text, which
can serve a variety of industries such as education,
media and publishing (Wang et al.,2021b).
However, Chinese GEC (CGEC) is still con-
fronted with the following three obstacles: (1)
Lack of data.
The major obstacle in CGEC is
that the high-quality manually annotated data is
limited compared to other languages (Dahlmeier
et al.,2013;Napoles et al.,2017;Rozovskaya and
Roth,2019;Bryant et al.,2019;Flachs et al.,2020;
∗Corresponding author.
†
Online evaluation site: https://codalab.lisn.upsaclay.fr/
competitions/8020.
Trinh and Rozovskaya,2021). There are only five
publicly accessible datasets in CGEC: NLPCC18
(Zhao et al.,2018) , CGED (Rao et al.,2020), CTC-
Qua, YACLC (Wang et al.,2021a) and MuCGEC
(Zhang et al.,2022). (2)
Data sources are non-
native speakers.
The sentences in NLPCC18,
CGED, YACLC and MuCGEC are all collected
from Chinese as a Foreign Language (CFL) learner
sources. However, massive errors from native
speakers rarely arise in these sources. Therefore,
the native speaker errors are more challenging with
the inclusion of pragmatic data. Though CTC-Qua
covers grammatical errors in native speakers, it has
insufficient scale with 972 sentences. (3)
Limited
multiple references.
For an erroneous sentence,
there tends to be different correction methods. The
sentences revised by the model may be correct, but
different from the ground truth. This may cause
unexpected performance degradation (Bryant and
Ng,2015). Besides, more references can offer vari-
ous correction schemas enabling the model to ac-
commodate more scenarios. Among CGEC, only
MuCGEC and YACLC provide rich references.
To tackle aforementioned obstacles, we present
FCGEC, a large-scale fine-grained GEC corpus
with multiple references. The sentences in FCGEC
are mainly collected from multi-choice questions
in public school Chinese examinations. Therefore,
our FCGEC is more challenging since it involves
more pragmatic data in the examinations of native
speakers. As for multiple references, we assign 2 to
4 annotators on each sentence, thus more references
can be attained in this way. Moreover, we generate
more references in the annotation process through
techniques with synonym substitution.
In order to correct the grammatical errors, recent
works are mostly based on two categories of bench-
mark models.
Sequence-to-sequence (Seq2Seq)
approaches regard GEC as a generation task that
straightforward converts an erroneous sentence to
the correct one (Yuan and Briscoe,2016;Zhao and
arXiv:2210.12364v1 [cs.CL] 22 Oct 2022