SynGEC Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser Yue Zhang1 Bo Zhang2 Zhenghua Li1 Zuyi Bao2 Chen Li2 Min Zhang1

2025-05-02 0 0 664.54KB 14 页 10玖币

侵权投诉

SynGEC: Syntax-Enhanced Grammatical Error Correction

with a Tailored GEC-Oriented Parser

Yue Zhang1, Bo Zhang2, Zhenghua Li1˚, Zuyi Bao2, Chen Li2, Min Zhang1

1Institute of Artiﬁcial Intelligence, School of Computer Science and Technology,

Soochow University, China; 2DAMO Academy, Alibaba Group, China

1yzhang21@stu.suda.edu.cn,1{zhli13,minzhang}@suda.edu.cn

2{klayzhang.zb,zuyi.bzy,puji.lc}@alibaba-inc.com

Abstract

This work proposes a syntax-enhanced gram-

matical error correction (GEC) approach

named SynGEC that effectively incorporates

dependency syntactic information into the en-

coder part of GEC models.1The key challenge

for this idea is that off-the-shelf parsers are un-

reliable when processing ungrammatical sen-

tences. To confront this challenge, we pro-

pose to build a tailored GEC-oriented parser

(GOPar) using parallel GEC training data as a

pivot. First, we design an extended syntax rep-

resentation scheme that allows us to represent

both grammatical errors and syntax in a uni-

ﬁed tree structure. Then, we obtain parse trees

of the source incorrect sentences by projecting

trees of the target correct sentences. Finally,

we train GOPar with such projected trees. For

GEC, we employ the graph convolution net-

work to encode source-side syntactic informa-

tion produced by GOPar, and fuse them with

the outputs of the Transformer encoder. Ex-

periments on mainstream English and Chinese

GEC datasets show that our proposed SynGEC

approach consistently and substantially outper-

forms strong baselines and achieves compet-

itive performance. Our code and data are

all publicly available at https://github.

com/HillZhang1999/SynGEC.

1 Introduction

Given an ungrammatical sentence, the grammat-

ical error correction (GEC) task aims to produce

a grammatical target sentence with the intended

meaning (Grundkiewicz et al.,2020;Wang et al.,

2021). Recent mainstream approaches treat GEC

as a monolingual machine translation (MT) task

(Yuan and Briscoe,2016;Junczys-Dowmunt et al.,

2018). Standard encoder-decoder based MT mod-

els, e.g., Transformer (Vaswani et al.,2017), have

˚Corresponding author.

Although this work focuses on the dependency syntax

structure, SynGEC can also be extended to the constituency

syntax structure straightforwardly.

emerged as a dominant paradigm and achieved

state-of-the-art (SOTA) results on various GEC

benchmarks (Rothe et al.,2021;Stahlberg and Ku-

mar,2021;Sun et al.,2022;Zhang et al.,2022).

Despite their impressive achievements, most work

treats the input sentence as a sequence of tokens,

without explicitly exploiting syntactic or semantic

information.

Compared with MT, GEC has two peculiarities

that directly motivate this work. First, the training

data for GEC models is much less abundant, which

may be alleviated by incorporating linguistic struc-

ture knowledge like syntax. As shown in Table 1

and Table 5, the English and Chinese GEC tasks

only have about 126K and 157K high-quality la-

belled source/target sentence pairs for training, if

not considering the highly noisy crowd-annotated

Lang8 data (Mita et al.,2020). Second, according

to our preliminary observation, many errors in un-

grammatical sentences are intrinsically correlated

with syntactic information. For example, errors like

inconsistency in tense or singular-vs-plural forms

can be better detected and corrected with the help

of long-range syntactic dependencies.

In this paper, we propose SynGEC, an approach

that can effectively inject the syntactic structure of

the input sentence into the encoder part of GEC

models. The critical challenge here is that off-the-

shelf parsers are unreliable when handling ungram-

matical sentences. On the one hand, off-the-shelf

parsers are trained on clean treebanks that only

consist of grammatical sentences. When parsing

ungrammatical sentences, their performance may

sharply degrade due to the input mismatch. On

the other hand, mainstream syntax representation

schemes, adopted by existing treebanks, do not

cover the non-canonical structures arising from

grammatical errors. In consequence, under such

schemes, it is sometimes difﬁcult to ﬁnd a plausible

syntactic tree to properly parse an ungrammatical

sentence (e.g., the sentence in Figure 1(d)).

arXiv:2210.12484v1 [cs.CL] 22 Oct 2022

But there were no buyers .

Root

expl

nsubj

punct

det

(a) The correct sentence.

Bat there were no buyers .

Root

expl

nsubj

punct

det

(b) Substituted errors.

But there were no any buyers .

Root

expl

nsubj

punct

det

But there were no H.

Root

expl

det

(d) Missing errors.

Figure 1: Illustration of our extended syntax representation scheme. Hdenotes the missing word.

Indeed, there have been several prior works that

try to improve syntactic parsing for ungrammatical

texts by annotating data (Dickinson and Ragheb,

2009;Berzak et al.,2016;Nagata and Sakaguchi,

2016). However, these works do not extend exist-

ing syntax representation schemes to accommodate

errors, which means that they make little change on

the original syntactic label sets. Besides, manual

annotation is expensive and time-consuming, so

their annotated treebanks for ungrammatical sen-

tences are of a relatively small scale.

To confront the challenge of unreliable perfor-

mance of off-the-shelf parsers on ungrammatical

sentences, we propose to train a tailored GEC-

oriented parser (GOPar). The basic idea is to utilize

parallel source/target sentence pairs in the GEC

training data. First, we parse the target correct sen-

tences using a vanilla off-the-shelf parser. Then, we

construct the tree for the source incorrect sentences

via tree projection. To accommodate grammatical

errors, we propose an extended syntax representa-

tion scheme based on several straightforward rules,

which allows us to represent both grammatical er-

rors and syntax in a uniﬁed tree structure. Finally,

we train GOPar directly on the automatically con-

structed trees of the source incorrect sentences in

the GEC training data. During both GEC training

and evaluation procedures, GOPar is used to gener-

ate syntactic information for the input sentences.

To incorporate syntactic information provided

by GOPar, we cascade several label-aware graph

convolutional network (GCN) layers (Kipf and

Welling,2017;Zhang et al.,2020a) above the

encoder of our baseline Transformer-based GEC

model. We conduct experiments on two widely-

used English GEC evaluation datasets, i.e., CoNLL-

14 (Ng et al.,2014) and BEA-19 (Bryant et al.,

2019), and two Chinese GEC evaluation datasets,

i.e., NLPCC-18 (Zhao et al.,2018) and MuCGEC

(Zhang et al.,2022). Extensive experimental re-

sults and in-depth analyses show that our SynGEC

approach achieves consistent and substantial im-

provement on all datasets, even when the baseline

model is enhanced with large pre-trained language

models (PLMs) like BART (Lewis et al.,2020),

and outperforming previous SOTA systems under

comparable settings.

2 Our GEC-Oriented Parser

This section describes our tailored GOPar, a de-

pendency parser that is more competent in parsing

ungrammatical sentences than off-the-shelf parsers.

2.1 Extended Syntax Representation Scheme

The standard scheme for representing dependency

syntax is originally designed for grammatical sen-

tences, and thus may not cover many non-canonical

structures in grammatically erroneous sentences.

Therefore, to obtain a tailored parser, our ﬁrst task

is to extend the syntax representation scheme and,

more speciﬁcally, to design a complementary set

of rules to handle different grammatical mistakes.

With this scheme, we can directly use a uniﬁed tree

structure to represent both grammatical errors and

syntactic information.

As shown in Figure 1, we propose a light-weight

extended scheme based on several straightforward

rules, corresponding to the three types of grammat-

ical errors, i.e., substituted,redundant and missing

(Bryant et al.,2017).

In this work, we use the

Stanford Dependencies Scheme v3.3.0 (Dozat and

Manning,2017) as the basic scheme. The rules

are designed in such a way that we make as few

adjustments as possible to the syntactic tree of the

target correct sentence during tree projection. Cor-

respondingly, we add three labels into the original

syntactic label set, i.e., “S”, “R” and “M”, to cap-

ture three kinds of errors. Since such categorization

is also adopted in the grammatical error detection

(GED) task (Yuan et al.,2021), we refer to them as

GED labels.

We treat word-order errors as the combination of redun-

dant and missing errors in this work.

Privacy belongs to human rights .

Privicy protection belongs

to human rights .

Parallel GEC data

Grammatical sentence yi

Ungrammatical sentence xi

Privacy belongs to human rights .

nsubj

root

prep pobj

amod

punct

Dependency tree of yi

①Parsing

②Aligning

A 0 1|||S|||Privacy|||REQUIRED|||-NONE-|||1

A1 2|||R|||||REQUIRED|||-NONE-|||1

Extracted errors

Sroot

prep

pobj

amod

punct

Dependency tree of xi

Privicy protection belongs to human rights .

③Projecting

Off-the-shelf parser

Tailored GOPar

④Training

Figure 2: The workﬂow for obtaining our tailored GOPar.

•Substituted errors (S)

include spelling er-

rors, tense errors, singular/plural inconsis-

tency errors, etc. For simplicity, we do not

consider such ﬁne-grained categories, and di-

rectly use a single “S” label to indicate that

the word should be replaced by another one,

as shown in Figure 1(b).

•Redundant errors (R)

mean that some words

should be deleted. For each redundant word,

we let it depend on its right-side adjacent

word, with a label “R”,

as shown in Figure

1(c). When the redundant word is at the end

of the sentence, we instead let it depend on its

left-side adjacent word.

•Missing errors (M)

mean that some words

should be inserted. For each missing word,

we assign a label “M” to the incoming arc

of its right-side adjacent word, as shown in

Figure 1(d). When the missing word is at

the end of the sentence, we keep the original

tree unchanged. If several consecutive words

are missing, the structure remains the same

as when a single word is missing. Moreover,

since a missing word may have children in

the tree of the correct sentence, we let them

depend on the head word of the missing word,

without changing their syntactic labels.

Limitation discussion.

Our extended scheme

may encounter problems when different types of

errors occur consecutively. Taking “But was no

buyers” as an example, we need to replace “was”

with “were” and then insert “there” before “were”

at the same time. Therefore, according to our rules,

the label of the incoming arc of “was” can be ei-

ther “S” or “M”, leading to a label conﬂict. To

decide a unique label, we simply deﬁne a priority

order: “S”

“R”

“M’. Overall, the current version

of our scheme is imperfect, and there are still many

3Even if the right-side adjacent word is redundant.

points that can be improved. For example, when

confronting substituted and redundant errors, some

original labels will be overwritten by the GED label

“S” and “M”, which may cause the loss of some

valuable information. One possible solution is to

combine GED and syntax labels and use joint la-

bels like “S-Root”, “M-Subj”, etc. We leave such

extensions of our scheme as future work.

2.2 Training GOPar

With the extended syntax representation scheme,

we propose to train our tailored GOPar by using

the parallel GEC training data D“ tpxi, yiqu as a

pivot. The major goal is to automatically generate

high-quality parse trees for large-scale sentences

with realistic grammatical errors, and use them to

train a parser suitable for parsing ungrammatical

sentences. Figure 2illustrates the workﬂow, con-

sisting of the following four steps.

First, we use an off-the-shelf parser to parse the

target correct sentences (i.e.,

) of the GEC train-

ing data. The off-the-shelf parser can produce reli-

able parse trees for target-side sentences since they

are (ideally) free from grammatical errors.

Second, we employ ERRANT (Bryant et al.,

2017)

to extract all grammatical errors in the

source incorrect sentence (i.e.,

) according to

the alignments between

and

. The errors ex-

tracted by ERRANT mainly contain 3 parts: the

start and end positions of errors in source sentences,

the corresponding corrections, and the error types.

Third, we construct the tree of

by projecting

the target-side tree of

to the source side. For

words that are not related to any errors, dependen-

cies and labels are directly copied; for those related

to errors, dependencies and labels are assigned ac-

cording to the rules introduced in Section 2.1.

Fourth, with constructed parse trees for all

source-side sentences in

, we then use them as a

4https://github.com/chrisjbryant/

errant

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SynGEC:Syntax-EnhancedGrammaticalErrorCorrectionwithaTailoredGEC-OrientedParserYueZhang1,BoZhang2,ZhenghuaLi1,ZuyiBao2,ChenLi2,MinZhang11InstituteofArticialIntelligence,SchoolofComputerScienceandTechnology,SoochowUniversity,China;2DAMOAcademy,AlibabaGroup,China1yzhang21@stu.suda.edu.cn,1{zhli13,mi...

展开>> 收起<<

SynGEC Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser Yue Zhang1 Bo Zhang2 Zhenghua Li1 Zuyi Bao2 Chen Li2 Min Zhang1.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SynGEC Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser Yue Zhang1 Bo Zhang2 Zhenghua Li1 Zuyi Bao2 Chen Li2 Min Zhang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: