FCGEC Fine-Grained Corpus for Chinese Grammatical Error Correction Lvxiaowei Xu1 Jianwang Wu1 Jiawei Peng1 Jiayu Fu2 Ming Cai1

2025-05-06 0 0 1.66MB 19 页 10玖币
侵权投诉
FCGEC: Fine-Grained Corpus for Chinese Grammatical Error
Correction
Lvxiaowei Xu1, Jianwang Wu1, Jiawei Peng1, Jiayu Fu2, Ming Cai1
1Department of Computer Science and Technology, Zhejiang University
2Base Station Platform Software Development Dept, Huawei Co., Ltd
1{xlxw, wujw, pengjw, cm}@zju.edu.cn
2fionafu0808@gmail.com
Abstract
Grammatical Error Correction (GEC) has been
broadly applied in automatic correction and
proofreading system recently. However, it is
still immature in Chinese GEC due to lim-
ited high-quality data from native speakers
in terms of category and scale. In this pa-
per, we present FCGEC, a fine-grained cor-
pus to detect, identify and correct the gram-
matical errors. FCGEC is a human-annotated
corpus with multiple references, consisting
of 41,340 sentences collected mainly from
multi-choice questions in public school Chi-
nese examinations. Furthermore, we pro-
pose a Switch-Tagger-Generator (STG) base-
line model to correct the grammatical errors
in low-resource settings. Compared to other
GEC benchmark models, experimental results
illustrate that STG outperforms them on our
FCGEC. However, there exists a significant
gap between benchmark models and humans
that encourages future models to bridge it. Our
annotation corpus and codes are available at
https://github.com/xlxwalex/FCGEC.
1 Introduction
Grammatical error correction (GEC) is a complex
task, aiming at detecting, identifying and correct-
ing various grammatical errors in a given sentence.
GEC has recently attracted more attention due to
its ability to correct and proofread the text, which
can serve a variety of industries such as education,
media and publishing (Wang et al.,2021b).
However, Chinese GEC (CGEC) is still con-
fronted with the following three obstacles: (1)
Lack of data.
The major obstacle in CGEC is
that the high-quality manually annotated data is
limited compared to other languages (Dahlmeier
et al.,2013;Napoles et al.,2017;Rozovskaya and
Roth,2019;Bryant et al.,2019;Flachs et al.,2020;
Corresponding author.
Online evaluation site: https://codalab.lisn.upsaclay.fr/
competitions/8020.
Trinh and Rozovskaya,2021). There are only five
publicly accessible datasets in CGEC: NLPCC18
(Zhao et al.,2018) , CGED (Rao et al.,2020), CTC-
Qua, YACLC (Wang et al.,2021a) and MuCGEC
(Zhang et al.,2022). (2)
Data sources are non-
native speakers.
The sentences in NLPCC18,
CGED, YACLC and MuCGEC are all collected
from Chinese as a Foreign Language (CFL) learner
sources. However, massive errors from native
speakers rarely arise in these sources. Therefore,
the native speaker errors are more challenging with
the inclusion of pragmatic data. Though CTC-Qua
covers grammatical errors in native speakers, it has
insufficient scale with 972 sentences. (3)
Limited
multiple references.
For an erroneous sentence,
there tends to be different correction methods. The
sentences revised by the model may be correct, but
different from the ground truth. This may cause
unexpected performance degradation (Bryant and
Ng,2015). Besides, more references can offer vari-
ous correction schemas enabling the model to ac-
commodate more scenarios. Among CGEC, only
MuCGEC and YACLC provide rich references.
To tackle aforementioned obstacles, we present
FCGEC, a large-scale fine-grained GEC corpus
with multiple references. The sentences in FCGEC
are mainly collected from multi-choice questions
in public school Chinese examinations. Therefore,
our FCGEC is more challenging since it involves
more pragmatic data in the examinations of native
speakers. As for multiple references, we assign 2 to
4 annotators on each sentence, thus more references
can be attained in this way. Moreover, we generate
more references in the annotation process through
techniques with synonym substitution.
In order to correct the grammatical errors, recent
works are mostly based on two categories of bench-
mark models.
Sequence-to-sequence (Seq2Seq)
approaches regard GEC as a generation task that
straightforward converts an erroneous sentence to
the correct one (Yuan and Briscoe,2016;Zhao and
arXiv:2210.12364v1 [cs.CL] 22 Oct 2022
Wang,2020;Fu et al.,2018). However, training
such a generation model requires more computa-
tional resources due to the autoregressive decoder.
Moreover, the generated style of Seq2Seq mod-
els is more arbitrary, which is not well applicable
for GEC task. More recently,
sequence-to-edit
(Seq2Edit)
approaches gain interests which take
GEC as a token-level labeling task (Awasthi et al.,
2019;Omelianchuk et al.,2020) via different edits,
such as insert,delete, etc. Nevertheless, previous
work falls short on altering the word order and
correcting errors simultaneously with iterating.
To fill these gaps, we propose Switch-Tagger-
Generator (STG) model as an effective baseline to
correct grammatical errors in low-resource settings
inspired by Mallinson et al. (2020). Our STG can
be decomposed into three modules: Switch mod-
ule determines the permutation of characters while
Tagger module identifies the operation tags of each
character in the sequence. Notably, benefiting from
carefully designed compound tags, we eliminate
the necessity for iteration. As for Generator mod-
ule, we adopt non-autoregressive approach to fill
in the characters that do not appear in the source.
In summary, our contributions are as follows:
1.
We present FCGEC, a large-scale fine-grained
corpus with multiple references and more
challenging errors for CGEC.
2.
We propose a STG model and then conduct ex-
periments to compare with two categories of
benchmark models (Seq2Seq and Seq2Edit).
Experimental results illustrate that our STG
model outperforms these models on FCGEC.
3.
We find a significant gap between human per-
formance and benchmark models that encour-
age future models to bridge it.
2 Corpus Construction
2.1 Data Collection
We collect raw sentences mainly from two re-
sources to obtain various Chinese grammatical er-
ror corpus from native speakers.
(1) Public ex-
amination websites.
We crawl the multi-choice
grammatical error problems (More erroneous sen-
tences than correct sentences) through public web-
sites which contain exercises and exams designed
by teachers and experts. These problems cover pub-
lic school examinations for native students from
elementary to high school.
(2) News aggregator
sites.
To balance the quantity of erroneous sen-
tences and correct sentences, we attain a diverse
range of high quality sentences without grammati-
cal error in news aggregator sites.
In total, we collect 54,026 raw sentences from
these resources. After removing duplicated or in-
complete sentences, there are 41,340 sentences in
our FCGEC corpus. We describe in detail the data
sources and data structures in Appendix A&B.
2.2 Fine-grained Data Format
To facilitate model for grammatical error detection
and correction, we designate three-tier hierarchical
levels of golden labels in FCGEC as follows:
Detection Level.
As a preliminary procedure to
correcting grammatical errors, we require the bi-
nary classification of a given sentence according to
whether it contains grammatical errors or not.
Identification Level.
The labels in this level
could be regarded as necessary for a multi-class
categorization problem. As the examples shown in
Table 1, we group grammatical errors into seven
categories based on the error hierarchy. The defini-
tion of error types are as follows:
Incorrect Word
Collocation (IWC)
is a word-level grammatical
error in which the related words are combined in
the improper pattern.
Component Missing (CM)
and
Component Redundancy (CR)
are also word-
level errors that some components (e.g., subject
and object) of the sentence are missing or redun-
dant.
Structure Confusion (SC)
is a syntax-level
grammatical error that combines two similar gram-
matical structures into a single incorrect one.
In-
correct Word Order (IWO)
covers grammatical
errors in word-level and pragmatic-level. Com-
pared to the previous work (Zhang et al.,2022),
we also take into account the errors that require
logic, common sense on top of syntax (e.g., recur-
sive relationships).
Illogical (ILL)
and
Ambiguity
(AM)
are pragmatic errors. The former comprises
contradictory statements, while the latter includes
expressions with indeterminate meanings.
Correction Level.
In the correction level, we
propose an operation-oriented paradigm to con-
struct GEC labels instead of the error-coded or
rewriting paradigms utilized in previous works (Ng
et al.,2014;Sakaguchi et al.,2016). In rewriting
paradigms, the annotators directly rewrite the raw
sentences to the correct sentences without grammat-
ical errors. However, it is difficult for annotators
[Delete][Modify]
为了
To prevent the tragedy of the earthquake from not emer-
[Delete] not [Modify] emerging happening
[Switch][Insert]
ging again, we should fortify and build .
[Switch] fortifybuild [Insert] shelters
Figure 1: An example of operation-oriented paradigm.
to rewrite in a consistent style, which leads to a
drop in annotation quality. As for the error-coded
paradigm, the annotators may diverge in determin-
ing the boundaries of the erroneous spans, thus
raising the complexity of the procedure.
In contrast, the operation-oriented paradigm is
on the basis of four fundamental correction oper-
ations : Insert,Delete,Modify and Switch. As an
example shown in Figure 1, this paradigm is more
compatible with the conventions of the annotator
when correcting errors. Meanwhile, annotators
only need to consider what operations are required
to correct the sentences, instead of paying atten-
tion to the erroneous span (e.g., the selection of
the words is left to post-processing for unified opti-
mization). In addition, we have a large amount of
correction prompts (explanations of grammatical
error problems) developed by teachers and experts
based on these four operations that can be utilized
to accelerate annotation process.
2.3 Annotation Procedure
The annotators are asked to follow the given
prompts to complete the three levels of labeling.
Notably, we allow annotators to add unlimited ref-
erences to sentences with grammatical errors based
on the four operations in error correction level.
In order to improve annotation efficiency, we
have developed a visual online tool to support the
annotation procedure. In addition, we applied pat-
tern matching and rule-based scripts to automati-
cally convert a large amount (72.3%) of prompts
into operation labels. We show the interface of our
visual annotation tool in Appendix C.
As for annotation process, we hire 20 undergrad-
uate students and 4 expert examiners to annotate
and verify the GEC labels. We follow the annota-
tion procedure in SuperGLUE (Wang et al.,2019a)
Type Example
IWC
什么
You have smart hands to do everything.
(Tips: “hands” cannot be combined with “smart”)
CM
绿氧气
Plants have (the ability) to produce oxygen.
(Tips: Lack of object “the ability” )
CR
We had walked about 10 miles or so.
(Tips: “about” and “or so” are redundant)
SC
Traffic accidents are caused by (because) looking
at cell phones while driving.
(Tips: the structure of “because” and “caused by”
cannot appear together in one sentence)
IWO
认识
I corrected and realized my fault.
(Tips: realize the fault first and correct it later)
ILL
We should prevent accidents from not occurring.
(Tips: double negation causes illogical errors)
AM
As the door opened, the doctor/patient came in.
(Tips: there is an ambiguity about who comes in)
Table 1: Examples of different types of errors.
that each annotator should work on test data first.
After that, they can compare their labels with the
gold ones. We encourage them to discuss their
mistakes, questions and standards with other an-
notators and experts. To attain high-quality anno-
tation with multiple references, we duplicate the
sentences in our corpus 2 to 4 times. Furthermore,
it is guaranteed that the redundant sentences are
annotated by different annotators. Then experts
are asked to review data that the annotators could
not in agreement on the labels and add reasonable
references. It is worth mentioning that we search
for possible synonyms of the characters generated
by Insert and Modify operations in annotation. We
believe that supplying more word choices to anno-
tators can improve the multi-reference rate. More-
over, we set up a weekly communication meeting
to discuss common issues in annotation and adapt
the labeling criteria. In total, the entire annotation
procedure lasted for more than 4 months.
2.4 Quality Control
To ensure the high-quality of our FCGEC, we adopt
the following five criteria: (1) Each sentence is in-
Corpus Source Paradigm Sentence #Error #Refs #Length
NLPCC(2018) CFL Error-coded 2000 1983(99.15%) 1.1 29.7
CGED CFL Error-coded 30145 25837(85.71%) 1.0 46.6
CTC-Qua(2021) Native Error-coded 972 482(49.59%) 1.0 48.9
MuCGEC(2022) CFL Rewriting 7063 6544(92.65%) 2.3 38.5
FCGEC (Ours) Native Operation 41340 22517(54.47%) 1.7 53.1
Table 2: The comparison of different Chinese grammatical error correction corpus. Numbers in row #Error mean
the percentage of incorrect sentences in the corpus. #Refs indicates the average number of references contained in
each sentence on average while #Length stands for the average number of characters in each sentence. Note that
CGED is a combined corpus from 2016 to 2018 (Rao et al.,2018,2020).
Subset Sent. Err. #S #D #I #M
Train 36340 19761 3930 10468 8705 7459
Valid 2000 1102 262 465 553 453
Test 3000 1654 421 1496 919 746
Table 3: Some statistics of FCGEC, including the num-
ber of sentences, the number of erroneous sentences,
and the number of four operations (#S, #D, #I, #M de-
note Switch,Delete,Insert,Modify, respectively).
spected by two specialized annotators to correct
spelling and punctuation errors before annotation.
Meanwhile, they have to eliminate the incomplete
sentences (due to unexpected text truncation). (2)
The specialized annotators were also asked to tag
the sentences from news aggregator source that
might have grammatical problems while checking
spelling errors. Then these potential sentences will
be discussed in weekly communication meeting.
(3) We ask the annotators to read our guidelines
and annotate twenty test instances. Then experts
check their accuracy of the annotation. The annota-
tors that meet the accuracy (90%) could continue
to label the official data. (4) We assign 2x to 4x
annotators per sentence for the corpus. In case
their annotations are different, the experts will de-
termine the correct labels. After that, annotators
can also learn from these mistakes to achieve self-
improvement. (5) After annotation, we unify the
annotated labels under the minimal operation cri-
teria inspired by Dahlmeier and Ng (2012) which
applies fewer operations during correcting gram-
matical errors. More details about minimal opera-
tion algorithm is described in Appendix E.
2.5 Data Statistics and Comparison
We compare our corpus with other Chinese gram-
matical error datasets in Table 2. Moreover, the
concrete statistics of FCGEC are shown in Table 3
Figure 2: Correlation between types and operations.
and Appendix D. We summarize the advantages of
our FCGEC in the following three aspects:
Multiple References.
As discussed in Bryant
and Ng (2015) and Zhang et al. (2022), the training
and evaluation of GEC models can benefit from
multiple references. In order to obtain more ref-
erences, we ask the annotators to submit different
reasonable operations for correcting errors. Mean-
while, we specifically provide several choices of
synonyms for the generated text during annotation.
We search for synonyms using both fine and coarse
granularity. The fine-grained approach is to obtain
synonyms from electronic dictionaries, while the
coarse-grained way relies on similarity of the word
vectors. It enhances the ratio of multiple references.
More Pragmatic Data.
Pragmatic data involves
errors in logic, common sense, ambiguity, etc. We
increase the proportion of pragmatic data (Table 6)
compared to other CGEC datasets, thus rendering
the data more complex and challenging. Notably,
we fix the ambiguity errors by providing references
to correct them from different semantics.
Effective Error Types.
We assign more refined
error types to the grammatical errors, and these
types are closely related to the correction opera-
tions. As shown in Figure 2, error types are always
highly relevant to particular operations (e.g., CM
摘要:

FCGEC:Fine-GrainedCorpusforChineseGrammaticalErrorCorrectionLvxiaoweiXu1,JianwangWu1,JiaweiPeng1,JiayuFu2,MingCai11DepartmentofComputerScienceandTechnology,ZhejiangUniversity2BaseStationPlatformSoftwareDevelopmentDept,HuaweiCo.,Ltd1{xlxw,wujw,pengjw,cm}@zju.edu.cn2fionafu0808@gmail.comAbstractGramm...

展开>> 收起<<
FCGEC Fine-Grained Corpus for Chinese Grammatical Error Correction Lvxiaowei Xu1 Jianwang Wu1 Jiawei Peng1 Jiayu Fu2 Ming Cai1.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.66MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注