FOCUS IS WHAT YOU NEED FOR CHINESE GRAMMATICAL ERROR CORRECTION Jingheng Ye1 Yinghui Li1 Shirong Ma1 Rui Xie2 Wei Wu2 Hai-Tao Zheng13 1Shenzhen International Graduate School Tsinghua University

2025-05-06 0 0 211.88KB 5 页 10玖币
侵权投诉
FOCUS IS WHAT YOU NEED FOR CHINESE GRAMMATICAL ERROR CORRECTION
Jingheng Ye1, Yinghui Li1, Shirong Ma1, Rui Xie2, Wei Wu2, Hai-Tao Zheng1,3
1Shenzhen International Graduate School, Tsinghua University
2Meituan, 3Peng Cheng Laboratory
ABSTRACT
Chinese Grammatical Error Correction (CGEC) aims to automat-
ically detect and correct grammatical errors contained in Chinese
text. In the long term, researchers regard CGEC as a task with a cer-
tain degree of uncertainty, that is, an ungrammatical sentence may
often have multiple references. However, we argue that even though
this is a very reasonable hypothesis, it is too harsh for the intelligence
of the mainstream models in this era. In this paper, we first discover
that multiple references do not actually bring positive gains to model
training. On the contrary, it is beneficial to the CGEC model if the
model can pay attention to small but essential data during the train-
ing process. Furthermore, we propose a simple yet effective train-
ing strategy called ONETARGET to improve the focus ability of the
CGEC models and thus improve the CGEC performance. Exten-
sive experiments and detailed analyses demonstrate the correctness
of our discovery and the effectiveness of our proposed method.
Index TermsNatural Language Processing, Chinese Gram-
matical Error Correction, Multi-Reference, ONETARGET
1. INTRODUCTION
Grammatical Error Correction (GEC) aims to correct all grammati-
cal errors in a given ungrammatical text, with the constraint of keep-
ing the original semantic as possible [1]. The GEC task has at-
tracted more and more attention due to its practical value in daily
life, and has been widely used in machine translation [2], spoken
language [3], and user-centric application [4]. In the Chinese natural
language processing community, Chinese Grammatical Error Cor-
rection (CGEC) also plays an important role [5, 6].
In the long-term development of CGEC, multi-reference is al-
ways an important setting that cannot be ignored because of the
uncertainty and subjectivity of the CGEC task [7]. Take Table 1
as an example, for the ungrammatical source sentence, there of-
ten exist two or more grammatically correct and semantically un-
changed reference sentences. We show that Lang8 1, a widely used
CGEC training dataset, consists of lots of multi-reference samples,
i.e., multiple different target sentences for the same ungrammatical
source sentence. In addition, the multi-reference setting is also an
important aspect of the recently proposed CGEC benchmark called
MuCGEC [8]. But is this setting always perfect and necessary?
Dialectically, we argue that the multi-reference setting is rea-
sonable for better evaluation quality, but introduces more uncer-
tainty into the training process of the model. Particularly, in the
process of CGEC dataset annotation, it will be difficult to decide
which reference is the best among multiple equally acceptable cor-
rections. Therefore, the compatibility brought by the multi-reference
* Corresponding author. (E-mail: zheng.haitao@sz.tsinghua.edu.cn)
1http://tcci.ccf.org.cn/conference/2018/taskdata.php
Source 能胜
I am competent for this the position.
Ref. 1 能胜
I am competent for this the position.
Ref. 2 能胜
I am competent for this the position.
Table 1. The example of CGEC. The correction part is marked.
setting can guarantee a more realistic evaluation of the model per-
formance [9]. But every coin has two sides, we find that more cor-
rection uncertainty exists in multi-reference training samples, which
could confuse models during training. We think that the main rea-
son why the model would be confused by multiple references is that
the intelligence of the mainstream models in this era is not enough
to support them effectively distinguishing different references, while
human language learners can easily establish complex connections
between these references and promote their writing skills.
Based on the above observations and intuitions, we investigate
how multi-reference samples influence the performance of CGEC
models. Specifically, training models with multi-reference sam-
ples could be treated as a multi-label classification problem (for
Seq2Edit-based models) or multi-target generation problem (for
Seq2Seq-based models), which dilutes the prediction probability
mass of models. Consequently, models will be confused and get
into a dilemma without additional mechanisms introduced. In order
to remedy this problem, we propose a simple yet effective training
strategy called ONETARGET, which is model-agnostic. ONETAR-
GET picks out only one reference based on different strategies for
a source sentence with multiple candidate references and keeps the
one-reference sample as it is. Though fine-tuning CGEC models
with fewer references, ONETARGET gives the models better focus
ability and improves their performance, this is the meaning of our
title, “focus is all you need for CGEC”.
We identify distinct advantages to training with our proposed
ONETARGET: (1) Training models using only one-reference sam-
ples is more efficient since cleaned datasets account for 50-60%
paired samples of original datasets. (2) Filtering unnecessary ref-
erences can achieve better performance, and it holds for both the
Seq2Edit-based model and the Seq2Seq-based model.
To be summarized, the contributions of our work are three-fold:
(1) We first observe and focus on the negative impact of the multi-
reference setting for the training of CGEC models. (2) We propose
ONETARGET to construct a smaller but more refined dataset, which
not only speeds up training but also improves the performance of
CGEC models. (3) We conduct extensive experiments and detailed
analyses on MuCGEC and achieve state-of-the-art performance.
arXiv:2210.12692v3 [cs.CL] 27 Oct 2022
摘要:

FOCUSISWHATYOUNEEDFORCHINESEGRAMMATICALERRORCORRECTIONJinghengYe1,YinghuiLi1,ShirongMa1,RuiXie2,WeiWu2,Hai-TaoZheng1;31ShenzhenInternationalGraduateSchool,TsinghuaUniversity2Meituan,3PengChengLaboratoryABSTRACTChineseGrammaticalErrorCorrection(CGEC)aimstoautomat-icallydetectandcorrectgrammaticalerr...

展开>> 收起<<
FOCUS IS WHAT YOU NEED FOR CHINESE GRAMMATICAL ERROR CORRECTION Jingheng Ye1 Yinghui Li1 Shirong Ma1 Rui Xie2 Wei Wu2 Hai-Tao Zheng13 1Shenzhen International Graduate School Tsinghua University.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:5 页 大小:211.88KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注