FOCUS IS WHAT YOU NEED FOR CHINESE GRAMMATICAL ERROR CORRECTION Jingheng Ye1 Yinghui Li1 Shirong Ma1 Rui Xie2 Wei Wu2 Hai-Tao Zheng13 1Shenzhen International Graduate School Tsinghua University

2025-05-06 0 0 211.88KB 5 页 10玖币

侵权投诉

FOCUS IS WHAT YOU NEED FOR CHINESE GRAMMATICAL ERROR CORRECTION

Jingheng Ye1, Yinghui Li1, Shirong Ma1, Rui Xie2, Wei Wu2, Hai-Tao Zheng1,3∗

1Shenzhen International Graduate School, Tsinghua University

2Meituan, 3Peng Cheng Laboratory

ABSTRACT

Chinese Grammatical Error Correction (CGEC) aims to automat-

ically detect and correct grammatical errors contained in Chinese

text. In the long term, researchers regard CGEC as a task with a cer-

tain degree of uncertainty, that is, an ungrammatical sentence may

often have multiple references. However, we argue that even though

this is a very reasonable hypothesis, it is too harsh for the intelligence

of the mainstream models in this era. In this paper, we ﬁrst discover

that multiple references do not actually bring positive gains to model

training. On the contrary, it is beneﬁcial to the CGEC model if the

model can pay attention to small but essential data during the train-

ing process. Furthermore, we propose a simple yet effective train-

ing strategy called ONETARGET to improve the focus ability of the

CGEC models and thus improve the CGEC performance. Exten-

sive experiments and detailed analyses demonstrate the correctness

of our discovery and the effectiveness of our proposed method.

Index Terms—Natural Language Processing, Chinese Gram-

matical Error Correction, Multi-Reference, ONETARGET

1. INTRODUCTION

Grammatical Error Correction (GEC) aims to correct all grammati-

cal errors in a given ungrammatical text, with the constraint of keep-

ing the original semantic as possible [1]. The GEC task has at-

tracted more and more attention due to its practical value in daily

life, and has been widely used in machine translation [2], spoken

language [3], and user-centric application [4]. In the Chinese natural

language processing community, Chinese Grammatical Error Cor-

rection (CGEC) also plays an important role [5, 6].

In the long-term development of CGEC, multi-reference is al-

ways an important setting that cannot be ignored because of the

uncertainty and subjectivity of the CGEC task [7]. Take Table 1

as an example, for the ungrammatical source sentence, there of-

ten exist two or more grammatically correct and semantically un-

changed reference sentences. We show that Lang8 1, a widely used

CGEC training dataset, consists of lots of multi-reference samples,

i.e., multiple different target sentences for the same ungrammatical

source sentence. In addition, the multi-reference setting is also an

important aspect of the recently proposed CGEC benchmark called

MuCGEC [8]. But is this setting always perfect and necessary?

Dialectically, we argue that the multi-reference setting is rea-

sonable for better evaluation quality, but introduces more uncer-

tainty into the training process of the model. Particularly, in the

process of CGEC dataset annotation, it will be difﬁcult to decide

which reference is the best among multiple equally acceptable cor-

rections. Therefore, the compatibility brought by the multi-reference

* Corresponding author. (E-mail: zheng.haitao@sz.tsinghua.edu.cn)

1http://tcci.ccf.org.cn/conference/2018/taskdata.php

Source 我能胜任这此职务

I am competent for this the position.

Ref. 1 我能胜任这此职务。

I am competent for this the position.

Ref. 2 我能胜任这此职务。

I am competent for this the position.

Table 1. The example of CGEC. The correction part is marked.

setting can guarantee a more realistic evaluation of the model per-

formance [9]. But every coin has two sides, we ﬁnd that more cor-

rection uncertainty exists in multi-reference training samples, which

could confuse models during training. We think that the main rea-

son why the model would be confused by multiple references is that

the intelligence of the mainstream models in this era is not enough

to support them effectively distinguishing different references, while

human language learners can easily establish complex connections

between these references and promote their writing skills.

Based on the above observations and intuitions, we investigate

how multi-reference samples inﬂuence the performance of CGEC

models. Speciﬁcally, training models with multi-reference sam-

ples could be treated as a multi-label classiﬁcation problem (for

Seq2Edit-based models) or multi-target generation problem (for

Seq2Seq-based models), which dilutes the prediction probability

mass of models. Consequently, models will be confused and get

into a dilemma without additional mechanisms introduced. In order

to remedy this problem, we propose a simple yet effective training

strategy called ONETARGET, which is model-agnostic. ONETAR-

GET picks out only one reference based on different strategies for

a source sentence with multiple candidate references and keeps the

one-reference sample as it is. Though ﬁne-tuning CGEC models

with fewer references, ONETARGET gives the models better focus

ability and improves their performance, this is the meaning of our

title, “focus is all you need for CGEC”.

We identify distinct advantages to training with our proposed

ONETARGET: (1) Training models using only one-reference sam-

ples is more efﬁcient since cleaned datasets account for 50-60%

paired samples of original datasets. (2) Filtering unnecessary ref-

erences can achieve better performance, and it holds for both the

Seq2Edit-based model and the Seq2Seq-based model.

To be summarized, the contributions of our work are three-fold:

(1) We ﬁrst observe and focus on the negative impact of the multi-

reference setting for the training of CGEC models. (2) We propose

ONETARGET to construct a smaller but more reﬁned dataset, which

not only speeds up training but also improves the performance of

CGEC models. (3) We conduct extensive experiments and detailed

analyses on MuCGEC and achieve state-of-the-art performance.

arXiv:2210.12692v3 [cs.CL] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FOCUSISWHATYOUNEEDFORCHINESEGRAMMATICALERRORCORRECTIONJinghengYe1,YinghuiLi1,ShirongMa1,RuiXie2,WeiWu2,Hai-TaoZheng1;31ShenzhenInternationalGraduateSchool,TsinghuaUniversity2Meituan,3PengChengLaboratoryABSTRACTChineseGrammaticalErrorCorrection(CGEC)aimstoautomat-icallydetectandcorrectgrammaticalerr...

展开>> 收起<<

FOCUS IS WHAT YOU NEED FOR CHINESE GRAMMATICAL ERROR CORRECTION Jingheng Ye1 Yinghui Li1 Shirong Ma1 Rui Xie2 Wei Wu2 Hai-Tao Zheng13 1Shenzhen International Graduate School Tsinghua University.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FOCUS IS WHAT YOU NEED FOR CHINESE GRAMMATICAL ERROR CORRECTION Jingheng Ye1 Yinghui Li1 Shirong Ma1 Rui Xie2 Wei Wu2 Hai-Tao Zheng13 1Shenzhen International Graduate School Tsinghua University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: