Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction Shirong Ma1 Yinghui Li1 Rongyi Sun1 Qingyu Zhou2 Shuling Huang1 Ding Zhang1

2025-05-06 0 0 992.52KB 14 页 10玖币
侵权投诉
Linguistic Rules-Based Corpus Generation for Native Chinese
Grammatical Error Correction
Shirong Ma1
, Yinghui Li1, Rongyi Sun1, Qingyu Zhou2, Shuling Huang1, Ding Zhang1,
Yangning Li1,Ruiyang Liu4,Zhongli Li2,Yunbo Cao2,Hai-Tao Zheng1,3
,Ying Shen5
1Tsinghua Shenzhen International Graduate School, Tsinghua University
2Tencent Cloud Xiaowei, 3Peng Cheng Laboratory
4Department of Computer Science and Technology, Tsinghua University
5School of Intelligent Systems Engineering, Sun-Yat Sen University
{masr21,liyinghu20}@mails.tsinghua.edu.cn
Abstract
Chinese Grammatical Error Correction
(CGEC) is both a challenging NLP task and
a common application in human daily life.
Recently, many data-driven approaches are
proposed for the development of CGEC
research. However, there are two major
limitations in the CGEC field: First, the lack
of high-quality annotated training corpora
prevents the performance of existing CGEC
models from being significantly improved.
Second, the grammatical errors in widely used
test sets are not made by native Chinese speak-
ers, resulting in a significant gap between the
CGEC models and the real application. In
this paper, we propose a linguistic rules-based
approach to construct large-scale CGEC
training corpora with automatically generated
grammatical errors. Additionally, we present
a challenging CGEC benchmark derived
entirely from errors made by native Chinese
speakers in real-world scenarios. Extensive
experiments1and detailed analyses not only
demonstrate that the training data constructed
by our method effectively improves the
performance of CGEC models, but also reflect
that our benchmark is an excellent resource
for further development of the CGEC field.
1 Introduction
In the field of theoretical linguistics, the native
speaker is the authority of the grammar.
- Noam Chomsky
Chinese Grammatical Error Correction (CGEC)
aims to automatically correct grammatical errors
that violate language rules and converts the noisy
input texts to clean output texts (Wang et al.,
indicates equal contribution. Work is done during
Yinghui’s internship at Tencent Cloud Xiaowei.
Corresponding author: Hai-Tao Zheng and
Ying Shen. (E-mail: zheng.haitao@sz.tsinghua.edu.cn,
sheny76@mail.sysu.edu.cn)
1
Our dataset and source codes are available at
https://
github.com/masr2000/CLG-CGEC.
2020c). In recent years, CGEC has attracted more
and more attention from NLP researchers due to its
broader applications in all kinds of daily scenarios
and downstream tasks (Duan and Hsu,2011;Kubis
et al.,2019;Omelianchuk et al.,2020).
With the progress of deep learning, data-driven
methods based on neural networks, e.g., Trans-
former (Vaswani et al.,2017), have become the
mainstream for CGEC (Zhao and Wang,2020;
Tang et al.,2021;Zhang et al.,2022;Li et al.,
2022a). However, we argue that there are still two
problems in CGEC: (1)
For model training
, ow-
ing to the limited number of real sentences contain-
ing grammatical errors, the long-term lack of high-
quality annotated training corpora hinders many
data-driven models from exercising their capabili-
ties on the CGEC task. (2)
For model evaluation
,
the widely used benchmarks such as NLPCC (Zhao
et al.,2018) and CGED (Rao et al.,2018,2020)
are all derived from the grammatical errors made
by foreign Chinese learners (i.e., L2 learners) in
their process of learning Chinese, the gap between
the language usage habits of L2 learners and Chi-
nese native speakers makes the performance of the
CGEC models in real scenarios unpredictable.
As illustrated in Table 1, the samples of NLPCC
and CGED are both from L2 learners, so their
sentence structures are relatively short and sim-
ple. More crucially, the grammatical errors in these
samples are very obvious and naive. On the other
hand, in the third example, the erroneous sentence
is fluent on the whole, which shows that the gram-
matical errors made by native speakers are more
subtle, that is, they are actually in line with the
habit of speaking in people’s daily communication,
but they do not conform to linguistic norms. There-
fore, in the broader scenarios of Chinese usage
besides foreigners learning Chinese, we believe
that wrong sentences made by native speakers are
more valuable and can better evaluate the model
performance than errors made by L2 learners.
arXiv:2210.10442v1 [cs.CL] 19 Oct 2022
NLPCC
Incorrect: 冲击
Translation: The news gave me a hit shock.
Correct: 冲击
Translation: The news gave me a big shock.
Source: L2 learners
CGED
Incorrect:
Translation: He was very attracted by the Japanese landscape.
Correct: 深深
Translation: He was deeply attracted by the Japanese landscape.
Source: L2 learners
Native
Error
Incorrect: 2017起跑线
Translation: Standing at the starting line of upcoming 2017, we cannot have to help feeling proud and joyful.
Correct: 2017起跑线
Translation: Standing at the starting line of 2017, we cannot help feeling proud and joyful.
Source: Native speakers
Table 1: Examples of grammatical errors from NLPCC, CGED and Chinese native speakers respectively.
To alleviate the dilemma of missing large-
scale training data
, we propose CLG, a novel
approach based on Chinese linguistic rules that au-
tomatically constructs high-quality ungrammatical
sentences from grammatical corpus. Specifically,
according to authoritative linguistic books (Huang
and Liao,2011;Shao,2016), we divide Chinese
grammatical errors into 6 categories, and design de-
tailed grammatical rules to generate corresponding
erroneous sentences according to the characteris-
tics of their respective errors. Our divided 6 error
types are: Structural Confusion,Improper Logical-
ity,Missing Component,Redundant Component,
Improper Collocation, and Improper Word Order.
Different from traditional data augmentation, un-
grammatical sentences generated by CLG are more
closely matching actual errors that Chinese native
speakers would make. Benefiting from our pro-
posed CLG, high-quality and large-scale training
samples are automatically constructed with anno-
tated error types. Moreover,
to fill the gap be-
tween existing benchmarks and practical appli-
cations
, we collect a test dataset containing gram-
matical errors made by native Chinese speakers in
real scenarios, named NaCGEC, which will be a
more challenging benchmark and a meaningful re-
source to facilitate further development of CGEC.
We conduct extensive experiments to demon-
strate the effectiveness of CLG and the challenge
of NaCGEC. Quantitative experiments show that
the model trained on our generated corpus per-
forms better than that trained on traditional CGEC
datasets. And compared with general data augmen-
tation methods, the training data obtained by CLG
brings larger performance improvements. In addi-
tion, qualitative analyses illustrate that it is more
difficult for well-educated Chinese native speakers
to identify grammatical errors in NaCGEC than in
previous existing benchmarks, which indicates that
errors in NaCGEC are closer to the real mistakes
that native speakers would make in their daily life.
We believe that our proposed corpus generation
approach and benchmark can greatly contribute to
the development of CGEC methods.
2 Related Work
2.1 CGEC Resources
Compared with English Grammatical Error Correc-
tion (EGEC), data resources for CGEC are still
lacking. The NLPCC (Zhao et al.,2018) pro-
vides a test set containing 2K sentences and a
large-scale dataset collected from the Lang-8 web-
site for training model. The CGED (Rao et al.,
2018,2020) is an evaluation dataset focusing on
error diagnosis which contains 5K sentences from
HSK corpus (Cui and Zhang,2011;Zhang and
Cui,2013). The entire HSK corpus can be also
utilized for model training. The YACLC (Wang
et al.,2021) collects and annotates 32K sentences
from Lang-8 to construct a CGEC dataset. The
latest MuCGEC (Zhang et al.,2022) selects and re-
annotates sentences from the NLPCC, CGED, and
Lang-8 corpora to obtain a multi-reference eval-
uation dataset with 7K sentences. To the best of
our knowledge, no existing resources focus on the
grammatical errors made by Chinese native speak-
ers, all of these above-mentioned datasets originate
from errors made by L2 learners.
2.2 CGEC Methods
CGEC can be considered as a seq2seq task. Some
existing works employ CNN-based (Ren et al.,
2018) or RNN-based (Zhou et al.,2018) models
to resolve the CGEC task. Most later work (Wang
et al.,2020a;Tang et al.,2021;Zhao and Wang,
2020) employs the Transformer (Vaswani et al.,
2017) model which has been a great success in Ma-
chine Translation. Those studies also propose some
data augmentation approaches to extend the train-
ing data for improving the model performance. Re-
cently, researchers start to treat CGEC as a seq2edit
task that iteratively predicts the modification la-
bel for each position of the sentence. Similar to
the GECToR (Omelianchuk et al.,2020), Liang
et al. (2020) utilize a seq2edit model for CGEC.
Zhang et al. (2022) directly adopt GECToR for
CGEC and enhances it by using pretrained lan-
guage models. TtT (Li and Shi,2021) proposes
a non-autoregressive CGEC approach by employ-
ing the BERT (Devlin et al.,2019) encoder with
CRF (Lafferty et al.,2001). Li et al. (2022a) pro-
poses a sequence-to-action model to resolve the
CGEC task, which combines the advantages of
both seq2seq and seq2edit approaches. Unlike pre-
vious works, we first focus on the linguistic rules of
Chinese grammar and exploit them to automatically
obtain high-quality training corpora to improve the
performance of CGEC models.
3 Automatic Corpus Generation and
Benchmark Construction
3.1 Schema Definition
According to the authoritative linguistic
books (Huang and Liao,2011;Shao,2016),
Chinese grammatical errors are categorized into 7
types: Structural Confusion,Improper Logicality,
Missing Component,Redundant Component,
Improper Collocation,Improper Word Order and
Ambiguity. It is worth noting that the errors of
ambiguity are often caused by the lack of context
information. So if we want the model to correct
such errors, we must have enough additional
knowledge besides grammar, which is beyond the
essence of the CGEC task. Therefore, we do not
consider this type of error. In addition, there is a
class of common errors, i.e., spelling errors which
are mainly caused by various confusing characters
with similar strokes/pronunciations (Li et al.,
2022b). But Wang et al. have comprehensively
studied how to automatically generate large-scale
training data containing spelling errors. How to
automatically generate high-quality training data
is one of our core contributions, so we also don’t
need to focus on spelling errors in our study.
From the linguistic point of view, the schema of
these 6 error types is explained as follows:
(1) Structural Confusion
(
) means to
mix two or more different syntactic structures
in one sentence, which results in confusing
sentence structure.
(2) Improper Logicality
(
) represents
that the meaning of a sentence is inconsistent
or does not conform to objective reasoning.
(3) Missing Component
(
) means that
the sentence structure is incomplete and some
grammatical components are missing.
(4) Redundant Component
(
) refers to
an addition of unnecessary words or phrases to
a well-structured sentence.
(5) Improper Collocation
(
) is that the
collocation between some components of a sen-
tence does not conform to the structural rules
or grammatical conventions of Chinese.
(6) Improper Word Order
(
) mainly
refers to the ungrammatical order of words or
clauses in a sentence.
Some example sentences containing various gram-
matical errors and the corresponding corrections
are presented in Table 2. These examples are all
selected from our proposed benchmark which will
be mentioned in Section 3.4.
Next, we will further introduce the process of
our proposed CLG and the details of NaCGEC.
3.2 Correct Sentences Collection
To achieve perfect correct/ungrammatical sentence
pairs for model training, we must make sure that
the correct sentences are free of any grammatical
errors as much as possible. However, collecting
and annotating large-scale correct sentences are ex-
tremely time-consuming and expensive processes.
To address this issue, we first accumulate massive
amounts of high-quality Chinese sentences from
public datasets (Xu,2019) such as the Chinese Peo-
ple’s Daily corpus, Chinese machine translation
dataset, and Chinese wiki corpus as the raw cor-
pora. Then we randomly selected 1,000 sentences
摘要:

LinguisticRules-BasedCorpusGenerationforNativeChineseGrammaticalErrorCorrectionShirongMa1,YinghuiLi1,RongyiSun1,QingyuZhou2,ShulingHuang1,DingZhang1,YangningLi1,RuiyangLiu4,ZhongliLi2,YunboCao2,Hai-TaoZheng1;3y,YingShen5y1TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity2TencentCloudX...

展开>> 收起<<
Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction Shirong Ma1 Yinghui Li1 Rongyi Sun1 Qingyu Zhou2 Shuling Huang1 Ding Zhang1.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:992.52KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注