Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction Shirong Ma1 Yinghui Li1 Rongyi Sun1 Qingyu Zhou2 Shuling Huang1 Ding Zhang1

2025-05-06 0 0 992.52KB 14 页 10玖币

侵权投诉

Linguistic Rules-Based Corpus Generation for Native Chinese

Grammatical Error Correction

Shirong Ma1∗

, Yinghui Li1∗, Rongyi Sun1, Qingyu Zhou2, Shuling Huang1, Ding Zhang1,

Yangning Li1,Ruiyang Liu4,Zhongli Li2,Yunbo Cao2,Hai-Tao Zheng1,3†

,Ying Shen5†

1Tsinghua Shenzhen International Graduate School, Tsinghua University

2Tencent Cloud Xiaowei, 3Peng Cheng Laboratory

4Department of Computer Science and Technology, Tsinghua University

5School of Intelligent Systems Engineering, Sun-Yat Sen University

{masr21,liyinghu20}@mails.tsinghua.edu.cn

Abstract

Chinese Grammatical Error Correction

(CGEC) is both a challenging NLP task and

a common application in human daily life.

Recently, many data-driven approaches are

proposed for the development of CGEC

research. However, there are two major

limitations in the CGEC ﬁeld: First, the lack

of high-quality annotated training corpora

prevents the performance of existing CGEC

models from being signiﬁcantly improved.

Second, the grammatical errors in widely used

test sets are not made by native Chinese speak-

ers, resulting in a signiﬁcant gap between the

CGEC models and the real application. In

this paper, we propose a linguistic rules-based

approach to construct large-scale CGEC

training corpora with automatically generated

grammatical errors. Additionally, we present

a challenging CGEC benchmark derived

entirely from errors made by native Chinese

speakers in real-world scenarios. Extensive

experiments1and detailed analyses not only

demonstrate that the training data constructed

by our method effectively improves the

performance of CGEC models, but also reﬂect

that our benchmark is an excellent resource

for further development of the CGEC ﬁeld.

1 Introduction

In the ﬁeld of theoretical linguistics, the native

speaker is the authority of the grammar.

- Noam Chomsky

Chinese Grammatical Error Correction (CGEC)

aims to automatically correct grammatical errors

that violate language rules and converts the noisy

input texts to clean output texts (Wang et al.,

∗∗

indicates equal contribution. Work is done during

Yinghui’s internship at Tencent Cloud Xiaowei.

††

Corresponding author: Hai-Tao Zheng and

Ying Shen. (E-mail: zheng.haitao@sz.tsinghua.edu.cn,

sheny76@mail.sysu.edu.cn)

Our dataset and source codes are available at

https://

github.com/masr2000/CLG-CGEC.

2020c). In recent years, CGEC has attracted more

and more attention from NLP researchers due to its

broader applications in all kinds of daily scenarios

and downstream tasks (Duan and Hsu,2011;Kubis

et al.,2019;Omelianchuk et al.,2020).

With the progress of deep learning, data-driven

methods based on neural networks, e.g., Trans-

former (Vaswani et al.,2017), have become the

mainstream for CGEC (Zhao and Wang,2020;

Tang et al.,2021;Zhang et al.,2022;Li et al.,

2022a). However, we argue that there are still two

problems in CGEC: (1)

For model training

, ow-

ing to the limited number of real sentences contain-

ing grammatical errors, the long-term lack of high-

quality annotated training corpora hinders many

data-driven models from exercising their capabili-

ties on the CGEC task. (2)

For model evaluation

the widely used benchmarks such as NLPCC (Zhao

et al.,2018) and CGED (Rao et al.,2018,2020)

are all derived from the grammatical errors made

by foreign Chinese learners (i.e., L2 learners) in

their process of learning Chinese, the gap between

the language usage habits of L2 learners and Chi-

nese native speakers makes the performance of the

CGEC models in real scenarios unpredictable.

As illustrated in Table 1, the samples of NLPCC

and CGED are both from L2 learners, so their

sentence structures are relatively short and sim-

ple. More crucially, the grammatical errors in these

samples are very obvious and naive. On the other

hand, in the third example, the erroneous sentence

is ﬂuent on the whole, which shows that the gram-

matical errors made by native speakers are more

subtle, that is, they are actually in line with the

habit of speaking in people’s daily communication,

but they do not conform to linguistic norms. There-

fore, in the broader scenarios of Chinese usage

besides foreigners learning Chinese, we believe

that wrong sentences made by native speakers are

more valuable and can better evaluate the model

performance than errors made by L2 learners.

arXiv:2210.10442v1 [cs.CL] 19 Oct 2022

NLPCC

Incorrect: 那个消息给我打冲击。

Translation: The news gave me a hit shock.

Correct: 那个消息给我很大冲击。

Translation: The news gave me a big shock.

Source: L2 learners

CGED

Incorrect: 他非常被日本的风景吸引了。

Translation: He was very attracted by the Japanese landscape.

Correct: 他深深地被日本的风景吸引了。

Translation: He was deeply attracted by the Japanese landscape.

Source: L2 learners

Native

Error

Incorrect: 站在即将到来的2017年的起跑线上，我们不由得情不自禁地感到自豪和喜悦。

Translation: Standing at the starting line of upcoming 2017, we cannot have to help feeling proud and joyful.

Correct: 站在2017年的起跑线上，我们情不自禁地感到自豪和喜悦。

Translation: Standing at the starting line of 2017, we cannot help feeling proud and joyful.

Source: Native speakers

Table 1: Examples of grammatical errors from NLPCC, CGED and Chinese native speakers respectively.

To alleviate the dilemma of missing large-

scale training data

, we propose CLG, a novel

approach based on Chinese linguistic rules that au-

tomatically constructs high-quality ungrammatical

sentences from grammatical corpus. Speciﬁcally,

according to authoritative linguistic books (Huang

and Liao,2011;Shao,2016), we divide Chinese

grammatical errors into 6 categories, and design de-

tailed grammatical rules to generate corresponding

erroneous sentences according to the characteris-

tics of their respective errors. Our divided 6 error

types are: Structural Confusion,Improper Logical-

ity,Missing Component,Redundant Component,

Improper Collocation, and Improper Word Order.

Different from traditional data augmentation, un-

grammatical sentences generated by CLG are more

closely matching actual errors that Chinese native

speakers would make. Beneﬁting from our pro-

posed CLG, high-quality and large-scale training

samples are automatically constructed with anno-

tated error types. Moreover,

to ﬁll the gap be-

tween existing benchmarks and practical appli-

cations

, we collect a test dataset containing gram-

matical errors made by native Chinese speakers in

real scenarios, named NaCGEC, which will be a

more challenging benchmark and a meaningful re-

source to facilitate further development of CGEC.

We conduct extensive experiments to demon-

strate the effectiveness of CLG and the challenge

of NaCGEC. Quantitative experiments show that

the model trained on our generated corpus per-

forms better than that trained on traditional CGEC

datasets. And compared with general data augmen-

tation methods, the training data obtained by CLG

brings larger performance improvements. In addi-

tion, qualitative analyses illustrate that it is more

difﬁcult for well-educated Chinese native speakers

to identify grammatical errors in NaCGEC than in

previous existing benchmarks, which indicates that

errors in NaCGEC are closer to the real mistakes

that native speakers would make in their daily life.

We believe that our proposed corpus generation

approach and benchmark can greatly contribute to

the development of CGEC methods.

2 Related Work

2.1 CGEC Resources

Compared with English Grammatical Error Correc-

tion (EGEC), data resources for CGEC are still

lacking. The NLPCC (Zhao et al.,2018) pro-

vides a test set containing 2K sentences and a

large-scale dataset collected from the Lang-8 web-

site for training model. The CGED (Rao et al.,

2018,2020) is an evaluation dataset focusing on

error diagnosis which contains 5K sentences from

HSK corpus (Cui and Zhang,2011;Zhang and

Cui,2013). The entire HSK corpus can be also

utilized for model training. The YACLC (Wang

et al.,2021) collects and annotates 32K sentences

from Lang-8 to construct a CGEC dataset. The

latest MuCGEC (Zhang et al.,2022) selects and re-

annotates sentences from the NLPCC, CGED, and

Lang-8 corpora to obtain a multi-reference eval-

uation dataset with 7K sentences. To the best of

our knowledge, no existing resources focus on the

grammatical errors made by Chinese native speak-

ers, all of these above-mentioned datasets originate

from errors made by L2 learners.

2.2 CGEC Methods

CGEC can be considered as a seq2seq task. Some

existing works employ CNN-based (Ren et al.,

2018) or RNN-based (Zhou et al.,2018) models

to resolve the CGEC task. Most later work (Wang

et al.,2020a;Tang et al.,2021;Zhao and Wang,

2020) employs the Transformer (Vaswani et al.,

2017) model which has been a great success in Ma-

chine Translation. Those studies also propose some

data augmentation approaches to extend the train-

ing data for improving the model performance. Re-

cently, researchers start to treat CGEC as a seq2edit

task that iteratively predicts the modiﬁcation la-

bel for each position of the sentence. Similar to

the GECToR (Omelianchuk et al.,2020), Liang

et al. (2020) utilize a seq2edit model for CGEC.

Zhang et al. (2022) directly adopt GECToR for

CGEC and enhances it by using pretrained lan-

guage models. TtT (Li and Shi,2021) proposes

a non-autoregressive CGEC approach by employ-

ing the BERT (Devlin et al.,2019) encoder with

CRF (Lafferty et al.,2001). Li et al. (2022a) pro-

poses a sequence-to-action model to resolve the

CGEC task, which combines the advantages of

both seq2seq and seq2edit approaches. Unlike pre-

vious works, we ﬁrst focus on the linguistic rules of

Chinese grammar and exploit them to automatically

obtain high-quality training corpora to improve the

performance of CGEC models.

3 Automatic Corpus Generation and

Benchmark Construction

3.1 Schema Deﬁnition

According to the authoritative linguistic

books (Huang and Liao,2011;Shao,2016),

Chinese grammatical errors are categorized into 7

types: Structural Confusion,Improper Logicality,

Missing Component,Redundant Component,

Improper Collocation,Improper Word Order and

Ambiguity. It is worth noting that the errors of

ambiguity are often caused by the lack of context

information. So if we want the model to correct

such errors, we must have enough additional

knowledge besides grammar, which is beyond the

essence of the CGEC task. Therefore, we do not

consider this type of error. In addition, there is a

class of common errors, i.e., spelling errors which

are mainly caused by various confusing characters

with similar strokes/pronunciations (Li et al.,

2022b). But Wang et al. have comprehensively

studied how to automatically generate large-scale

training data containing spelling errors. How to

automatically generate high-quality training data

is one of our core contributions, so we also don’t

need to focus on spelling errors in our study.

From the linguistic point of view, the schema of

these 6 error types is explained as follows:

(1) Structural Confusion

(

结构混乱

) means to

mix two or more different syntactic structures

in one sentence, which results in confusing

sentence structure.

(2) Improper Logicality

(

不合逻辑

) represents

that the meaning of a sentence is inconsistent

or does not conform to objective reasoning.

(3) Missing Component

(

成分残缺

) means that

the sentence structure is incomplete and some

grammatical components are missing.

(4) Redundant Component

(

成分冗余

) refers to

an addition of unnecessary words or phrases to

a well-structured sentence.

(5) Improper Collocation

(

搭配不当

) is that the

collocation between some components of a sen-

tence does not conform to the structural rules

or grammatical conventions of Chinese.

(6) Improper Word Order

(

语序不当

) mainly

refers to the ungrammatical order of words or

clauses in a sentence.

Some example sentences containing various gram-

matical errors and the corresponding corrections

are presented in Table 2. These examples are all

selected from our proposed benchmark which will

be mentioned in Section 3.4.

Next, we will further introduce the process of

our proposed CLG and the details of NaCGEC.

3.2 Correct Sentences Collection

To achieve perfect correct/ungrammatical sentence

pairs for model training, we must make sure that

the correct sentences are free of any grammatical

errors as much as possible. However, collecting

and annotating large-scale correct sentences are ex-

tremely time-consuming and expensive processes.

To address this issue, we ﬁrst accumulate massive

amounts of high-quality Chinese sentences from

public datasets (Xu,2019) such as the Chinese Peo-

ple’s Daily corpus, Chinese machine translation

dataset, and Chinese wiki corpus as the raw cor-

pora. Then we randomly selected 1,000 sentences

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LinguisticRules-BasedCorpusGenerationforNativeChineseGrammaticalErrorCorrectionShirongMa1,YinghuiLi1,RongyiSun1,QingyuZhou2,ShulingHuang1,DingZhang1,YangningLi1,RuiyangLiu4,ZhongliLi2,YunboCao2,Hai-TaoZheng1;3y,YingShen5y1TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity2TencentCloudX...

展开>> 收起<<

Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction Shirong Ma1 Yinghui Li1 Rongyi Sun1 Qingyu Zhou2 Shuling Huang1 Ding Zhang1.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction Shirong Ma1 Yinghui Li1 Rongyi Sun1 Qingyu Zhou2 Shuling Huang1 Ding Zhang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: