FCGEC Fine-Grained Corpus for Chinese Grammatical Error Correction Lvxiaowei Xu1 Jianwang Wu1 Jiawei Peng1 Jiayu Fu2 Ming Cai1

2025-05-06 0 0 1.66MB 19 页 10玖币

侵权投诉

FCGEC: Fine-Grained Corpus for Chinese Grammatical Error

Correction

Lvxiaowei Xu1, Jianwang Wu1, Jiawei Peng1, Jiayu Fu2, Ming Cai1∗

1Department of Computer Science and Technology, Zhejiang University

2Base Station Platform Software Development Dept, Huawei Co., Ltd

1{xlxw, wujw, pengjw, cm}@zju.edu.cn

2fionafu0808@gmail.com

Abstract

Grammatical Error Correction (GEC) has been

broadly applied in automatic correction and

proofreading system recently. However, it is

still immature in Chinese GEC due to lim-

ited high-quality data from native speakers

in terms of category and scale. In this pa-

per, we present FCGEC, a ﬁne-grained cor-

pus to detect, identify and correct the gram-

matical errors. FCGEC is a human-annotated

corpus with multiple references, consisting

of 41,340 sentences collected mainly from

multi-choice questions in public school Chi-

nese examinations. Furthermore, we pro-

pose a Switch-Tagger-Generator (STG) base-

line model to correct the grammatical errors

in low-resource settings. Compared to other

GEC benchmark models, experimental results

illustrate that STG outperforms them on our

FCGEC. However, there exists a signiﬁcant

gap between benchmark models and humans

that encourages future models to bridge it. Our

annotation corpus and codes are available at

https://github.com/xlxwalex/FCGEC†.

1 Introduction

Grammatical error correction (GEC) is a complex

task, aiming at detecting, identifying and correct-

ing various grammatical errors in a given sentence.

GEC has recently attracted more attention due to

its ability to correct and proofread the text, which

can serve a variety of industries such as education,

media and publishing (Wang et al.,2021b).

However, Chinese GEC (CGEC) is still con-

fronted with the following three obstacles: (1)

Lack of data.

The major obstacle in CGEC is

that the high-quality manually annotated data is

limited compared to other languages (Dahlmeier

et al.,2013;Napoles et al.,2017;Rozovskaya and

Roth,2019;Bryant et al.,2019;Flachs et al.,2020;

∗Corresponding author.

†

Online evaluation site: https://codalab.lisn.upsaclay.fr/

competitions/8020.

Trinh and Rozovskaya,2021). There are only ﬁve

publicly accessible datasets in CGEC: NLPCC18

(Zhao et al.,2018) , CGED (Rao et al.,2020), CTC-

Qua, YACLC (Wang et al.,2021a) and MuCGEC

(Zhang et al.,2022). (2)

Data sources are non-

native speakers.

The sentences in NLPCC18,

CGED, YACLC and MuCGEC are all collected

from Chinese as a Foreign Language (CFL) learner

sources. However, massive errors from native

speakers rarely arise in these sources. Therefore,

the native speaker errors are more challenging with

the inclusion of pragmatic data. Though CTC-Qua

covers grammatical errors in native speakers, it has

insufﬁcient scale with 972 sentences. (3)

Limited

multiple references.

For an erroneous sentence,

there tends to be different correction methods. The

sentences revised by the model may be correct, but

different from the ground truth. This may cause

unexpected performance degradation (Bryant and

Ng,2015). Besides, more references can offer vari-

ous correction schemas enabling the model to ac-

commodate more scenarios. Among CGEC, only

MuCGEC and YACLC provide rich references.

To tackle aforementioned obstacles, we present

FCGEC, a large-scale ﬁne-grained GEC corpus

with multiple references. The sentences in FCGEC

are mainly collected from multi-choice questions

in public school Chinese examinations. Therefore,

our FCGEC is more challenging since it involves

more pragmatic data in the examinations of native

speakers. As for multiple references, we assign 2 to

4 annotators on each sentence, thus more references

can be attained in this way. Moreover, we generate

more references in the annotation process through

techniques with synonym substitution.

In order to correct the grammatical errors, recent

works are mostly based on two categories of bench-

mark models.

Sequence-to-sequence (Seq2Seq)

approaches regard GEC as a generation task that

straightforward converts an erroneous sentence to

the correct one (Yuan and Briscoe,2016;Zhao and

arXiv:2210.12364v1 [cs.CL] 22 Oct 2022

Wang,2020;Fu et al.,2018). However, training

such a generation model requires more computa-

tional resources due to the autoregressive decoder.

Moreover, the generated style of Seq2Seq mod-

els is more arbitrary, which is not well applicable

for GEC task. More recently,

sequence-to-edit

(Seq2Edit)

approaches gain interests which take

GEC as a token-level labeling task (Awasthi et al.,

2019;Omelianchuk et al.,2020) via different edits,

such as insert,delete, etc. Nevertheless, previous

work falls short on altering the word order and

correcting errors simultaneously with iterating.

To ﬁll these gaps, we propose Switch-Tagger-

Generator (STG) model as an effective baseline to

correct grammatical errors in low-resource settings

inspired by Mallinson et al. (2020). Our STG can

be decomposed into three modules: Switch mod-

ule determines the permutation of characters while

Tagger module identiﬁes the operation tags of each

character in the sequence. Notably, beneﬁting from

carefully designed compound tags, we eliminate

the necessity for iteration. As for Generator mod-

ule, we adopt non-autoregressive approach to ﬁll

in the characters that do not appear in the source.

In summary, our contributions are as follows:

We present FCGEC, a large-scale ﬁne-grained

corpus with multiple references and more

challenging errors for CGEC.

We propose a STG model and then conduct ex-

periments to compare with two categories of

benchmark models (Seq2Seq and Seq2Edit).

Experimental results illustrate that our STG

model outperforms these models on FCGEC.

We ﬁnd a signiﬁcant gap between human per-

formance and benchmark models that encour-

age future models to bridge it.

2 Corpus Construction

2.1 Data Collection

We collect raw sentences mainly from two re-

sources to obtain various Chinese grammatical er-

ror corpus from native speakers.

(1) Public ex-

amination websites.

We crawl the multi-choice

grammatical error problems (More erroneous sen-

tences than correct sentences) through public web-

sites which contain exercises and exams designed

by teachers and experts. These problems cover pub-

lic school examinations for native students from

elementary to high school.

(2) News aggregator

sites.

To balance the quantity of erroneous sen-

tences and correct sentences, we attain a diverse

range of high quality sentences without grammati-

cal error in news aggregator sites.

In total, we collect 54,026 raw sentences from

these resources. After removing duplicated or in-

complete sentences, there are 41,340 sentences in

our FCGEC corpus. We describe in detail the data

sources and data structures in Appendix A&B.

2.2 Fine-grained Data Format

To facilitate model for grammatical error detection

and correction, we designate three-tier hierarchical

levels of golden labels in FCGEC as follows:

Detection Level.

As a preliminary procedure to

correcting grammatical errors, we require the bi-

nary classiﬁcation of a given sentence according to

whether it contains grammatical errors or not.

Identiﬁcation Level.

The labels in this level

could be regarded as necessary for a multi-class

categorization problem. As the examples shown in

Table 1, we group grammatical errors into seven

categories based on the error hierarchy. The deﬁni-

tion of error types are as follows:

Incorrect Word

Collocation (IWC)

is a word-level grammatical

error in which the related words are combined in

the improper pattern.

Component Missing (CM)

and

Component Redundancy (CR)

are also word-

level errors that some components (e.g., subject

and object) of the sentence are missing or redun-

dant.

Structure Confusion (SC)

is a syntax-level

grammatical error that combines two similar gram-

matical structures into a single incorrect one.

In-

correct Word Order (IWO)

covers grammatical

errors in word-level and pragmatic-level. Com-

pared to the previous work (Zhang et al.,2022),

we also take into account the errors that require

logic, common sense on top of syntax (e.g., recur-

sive relationships).

Illogical (ILL)

and

Ambiguity

(AM)

are pragmatic errors. The former comprises

contradictory statements, while the latter includes

expressions with indeterminate meanings.

Correction Level.

In the correction level, we

propose an operation-oriented paradigm to con-

struct GEC labels instead of the error-coded or

rewriting paradigms utilized in previous works (Ng

et al.,2014;Sakaguchi et al.,2016). In rewriting

paradigms, the annotators directly rewrite the raw

sentences to the correct sentences without grammat-

ical errors. However, it is difﬁcult for annotators

[Delete]不[Modify]浮现→发生

为了避免地震的悲剧不再浮现，我们都

To prevent the tragedy of the earthquake from not emer-

[Delete] not [Modify] emerging →happening

[Switch]加固↔建造↓[Insert]避难所

应该加固并建造。

ging again, we should fortify and build .

[Switch] fortify↔build ↑[Insert] shelters

Figure 1: An example of operation-oriented paradigm.

to rewrite in a consistent style, which leads to a

drop in annotation quality. As for the error-coded

paradigm, the annotators may diverge in determin-

ing the boundaries of the erroneous spans, thus

raising the complexity of the procedure.

In contrast, the operation-oriented paradigm is

on the basis of four fundamental correction oper-

ations : Insert,Delete,Modify and Switch. As an

example shown in Figure 1, this paradigm is more

compatible with the conventions of the annotator

when correcting errors. Meanwhile, annotators

only need to consider what operations are required

to correct the sentences, instead of paying atten-

tion to the erroneous span (e.g., the selection of

the words is left to post-processing for uniﬁed opti-

mization). In addition, we have a large amount of

correction prompts (explanations of grammatical

error problems) developed by teachers and experts

based on these four operations that can be utilized

to accelerate annotation process.

2.3 Annotation Procedure

The annotators are asked to follow the given

prompts to complete the three levels of labeling.

Notably, we allow annotators to add unlimited ref-

erences to sentences with grammatical errors based

on the four operations in error correction level.

In order to improve annotation efﬁciency, we

have developed a visual online tool to support the

annotation procedure. In addition, we applied pat-

tern matching and rule-based scripts to automati-

cally convert a large amount (72.3%) of prompts

into operation labels. We show the interface of our

visual annotation tool in Appendix C.

As for annotation process, we hire 20 undergrad-

uate students and 4 expert examiners to annotate

and verify the GEC labels. We follow the annota-

tion procedure in SuperGLUE (Wang et al.,2019a)

Type Example

IWC

自己有双聪明能干的手，什么都能做出来。

You have smart hands to do everything.

(Tips: “hands” cannot be combined with “smart”)

绿色植物具有产生氧气。

Plants have (the ability) to produce oxygen.

(Tips: Lack of object “the ability” )

我们已走了约十里左右的路程。

We had walked about 10 miles or so.

(Tips: “about” and “or so” are redundant)

交通事故发生的原因是开车看手机造成的。

Trafﬁc accidents are caused by (because) looking

at cell phones while driving.

(Tips: the structure of “because” and “caused by”

cannot appear together in one sentence)

IWO

我改正并认识了自己的错误。

I corrected and realized my fault.

(Tips: realize the fault ﬁrst and correct it later)

ILL

我们应该防止事故不发生。

We should prevent accidents from not occurring.

(Tips: double negation causes illogical errors)

刚一开门，看病的就进来了。

As the door opened, the doctor/patient came in.

(Tips: there is an ambiguity about who comes in)

Table 1: Examples of different types of errors.

that each annotator should work on test data ﬁrst.

After that, they can compare their labels with the

gold ones. We encourage them to discuss their

mistakes, questions and standards with other an-

notators and experts. To attain high-quality anno-

tation with multiple references, we duplicate the

sentences in our corpus 2 to 4 times. Furthermore,

it is guaranteed that the redundant sentences are

annotated by different annotators. Then experts

are asked to review data that the annotators could

not in agreement on the labels and add reasonable

references. It is worth mentioning that we search

for possible synonyms of the characters generated

by Insert and Modify operations in annotation. We

believe that supplying more word choices to anno-

tators can improve the multi-reference rate. More-

over, we set up a weekly communication meeting

to discuss common issues in annotation and adapt

the labeling criteria. In total, the entire annotation

procedure lasted for more than 4 months.

2.4 Quality Control

To ensure the high-quality of our FCGEC, we adopt

the following ﬁve criteria: (1) Each sentence is in-

Corpus Source Paradigm Sentence #Error #Refs #Length

NLPCC(2018) CFL Error-coded 2000 1983(99.15%) 1.1 29.7

CGED CFL Error-coded 30145 25837(85.71%) 1.0 46.6

CTC-Qua(2021) Native Error-coded 972 482(49.59%) 1.0 48.9

MuCGEC(2022) CFL Rewriting 7063 6544(92.65%) 2.3 38.5

FCGEC (Ours) Native Operation 41340 22517(54.47%) 1.7 53.1

Table 2: The comparison of different Chinese grammatical error correction corpus. Numbers in row #Error mean

the percentage of incorrect sentences in the corpus. #Refs indicates the average number of references contained in

each sentence on average while #Length stands for the average number of characters in each sentence. Note that

CGED is a combined corpus from 2016 to 2018 (Rao et al.,2018,2020).

Subset Sent. Err. #S #D #I #M

Train 36340 19761 3930 10468 8705 7459

Valid 2000 1102 262 465 553 453

Test 3000 1654 421 1496 919 746

Table 3: Some statistics of FCGEC, including the num-

ber of sentences, the number of erroneous sentences,

and the number of four operations (#S, #D, #I, #M de-

note Switch,Delete,Insert,Modify, respectively).

spected by two specialized annotators to correct

spelling and punctuation errors before annotation.

Meanwhile, they have to eliminate the incomplete

sentences (due to unexpected text truncation). (2)

The specialized annotators were also asked to tag

the sentences from news aggregator source that

might have grammatical problems while checking

spelling errors. Then these potential sentences will

be discussed in weekly communication meeting.

(3) We ask the annotators to read our guidelines

and annotate twenty test instances. Then experts

check their accuracy of the annotation. The annota-

tors that meet the accuracy (90%) could continue

to label the ofﬁcial data. (4) We assign 2x to 4x

annotators per sentence for the corpus. In case

their annotations are different, the experts will de-

termine the correct labels. After that, annotators

can also learn from these mistakes to achieve self-

improvement. (5) After annotation, we unify the

annotated labels under the minimal operation cri-

teria inspired by Dahlmeier and Ng (2012) which

applies fewer operations during correcting gram-

matical errors. More details about minimal opera-

tion algorithm is described in Appendix E.

2.5 Data Statistics and Comparison

We compare our corpus with other Chinese gram-

matical error datasets in Table 2. Moreover, the

concrete statistics of FCGEC are shown in Table 3

Figure 2: Correlation between types and operations.

and Appendix D. We summarize the advantages of

our FCGEC in the following three aspects:

Multiple References.

As discussed in Bryant

and Ng (2015) and Zhang et al. (2022), the training

and evaluation of GEC models can beneﬁt from

multiple references. In order to obtain more ref-

erences, we ask the annotators to submit different

reasonable operations for correcting errors. Mean-

while, we speciﬁcally provide several choices of

synonyms for the generated text during annotation.

We search for synonyms using both ﬁne and coarse

granularity. The ﬁne-grained approach is to obtain

synonyms from electronic dictionaries, while the

coarse-grained way relies on similarity of the word

vectors. It enhances the ratio of multiple references.

More Pragmatic Data.

Pragmatic data involves

errors in logic, common sense, ambiguity, etc. We

increase the proportion of pragmatic data (Table 6)

compared to other CGEC datasets, thus rendering

the data more complex and challenging. Notably,

we ﬁx the ambiguity errors by providing references

to correct them from different semantics.

Effective Error Types.

We assign more reﬁned

error types to the grammatical errors, and these

types are closely related to the correction opera-

tions. As shown in Figure 2, error types are always

highly relevant to particular operations (e.g., CM

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FCGEC:Fine-GrainedCorpusforChineseGrammaticalErrorCorrectionLvxiaoweiXu1,JianwangWu1,JiaweiPeng1,JiayuFu2,MingCai11DepartmentofComputerScienceandTechnology,ZhejiangUniversity2BaseStationPlatformSoftwareDevelopmentDept,HuaweiCo.,Ltd1{xlxw,wujw,pengjw,cm}@zju.edu.cn2fionafu0808@gmail.comAbstractGramm...

展开>> 收起<<

FCGEC Fine-Grained Corpus for Chinese Grammatical Error Correction Lvxiaowei Xu1 Jianwang Wu1 Jiawei Peng1 Jiayu Fu2 Ming Cai1.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FCGEC Fine-Grained Corpus for Chinese Grammatical Error Correction Lvxiaowei Xu1 Jianwang Wu1 Jiawei Peng1 Jiayu Fu2 Ming Cai1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: