Towards standardizing Korean Grammatical Error Correction Datasets and Annotation Soyoung Yoon

2025-05-06 0 0 3.5MB 27 页 10玖币

侵权投诉

Towards standardizing Korean Grammatical Error Correction:

Datasets and Annotation

Soyoung Yoon

KAIST AI

soyoungyoon@kaist.ac.kr

Sungjoon Park

SoftlyAI Research, SoftlyAI

sungjoon.park@softly.ai

Gyuwan Kim

UCSB∗

gyuwankim@ucsb.edu

Junhee Cho

Google

junheecho@google.com

Kihyo Park

Cornell University

kp467@cornell.edu

Gyutae Kim

SoftlyAI

gt.kim@softly.ai

Minjoon Seo

KAIST AI

minjoon@kaist.ac.kr

Alice Oh

KAIST SC†

alice.oh@kaist.edu

Abstract

Research on Korean grammatical error correc-

tion (GEC) is limited, compared to other ma-

jor languages such as English. We attribute

this problematic circumstance to the lack of

a carefully designed evaluation benchmark for

Korean GEC. In this work, we collect three

datasets from different sources (Kor-Lang8,

Kor-Native, and Kor-Learner) that covers a

wide range of Korean grammatical errors. Con-

sidering the nature of Korean grammar, We

then deﬁne 14 error types for Korean and pro-

vide KAGAS (Korean Automatic Grammatical

error Annotation System), which can automati-

cally annotate error types from parallel corpora.

We use KAGAS on our datasets to make an

evaluation benchmark for Korean, and present

baseline models trained from our datasets. We

show that the model trained with our datasets

signiﬁcantly outperforms the currently used sta-

tistical Korean GEC system (Hanspell) on a

wider range of error types, demonstrating the

diversity and usefulness of the datasets. The im-

plementations and datasets are open-sourced.1

1 Introduction

Writing grammatically correct Korean sentences

is difﬁcult for learners studying Korean as a For-

eign Language (KFL) and even for native Korean

speakers due to its morphological and orthographi-

cal complexity such as particles, spelling, and col-

location. Its word spacing rule is complex since

there are many domain-dependent exceptions, of

which only around 20% of native speakers under-

stand thoroughly (Lee,2014). Since Korean is an

agglutinative language (Sohn,2001;Song,2006),

getting used to Korean grammar is time-consuming

for KFL learners whose mother tongue is non-

agglutinative (Haupt et al.,2017;Kim,2020). How-

∗University of California, Santa Barbara

†School of Computing, KAIST

Code for model checkpoints and KAGAS:

https:

//github.com/soyoung97/Standard_Korean_GEC

. Dataset

request form: https://forms.gle/kF9pvJbLGvnh8ZnQ6

ever, despite the growing number of KFL learners

(Lee,2018), little research has been conducted on

Korean Grammatical Error Correction (GEC) be-

cause of the previously described difﬁculties of the

Korean language. Another major obstacle to devel-

oping a Korean GEC system is the lack of resources

to train a machine learning model.

In this paper, we propose three datasets that

cover various grammatical errors from different

types of annotators and learners. The ﬁrst dataset

named Kor-Native is crowd-sourced from native

Korean speakers. Second, Kor-Learner are from

KFL learners that consists of essays with de-

tailed corrections and annotations by Korean tutors.

Third, Kor-Lang8 are similar with Kor-Learner ex-

cept that they consist of sentences made by KFL

learners but corrected by native Koreans on social

platforms who are not necessarily linguistic experts.

We also analyze our datasets in terms of error type

distributions.

While our proposed parallel corpora can be

served as a valuable resource to train a machine

learning model, another concern is about the an-

notation of the datasets. Most existing datasets do

not have annotation, which makes it hard to use

them for evaluation. A major weakness of human

annotation is that (1) experts specialized in Korean

Grammar are expensive to hire, (2) making them

annotate a large number of parallel corpora is not

scalable, and (3) the error types and schema are dif-

ferent by datasets and annotators, which is counter-

productive. Another way that we can analyze and

evaluate on the dataset is by automatic annotation

from parallel corpora. While there is already one

for English called ERRANT (Bryant et al.,2017),

there is no automatic error type detection system

for Korean. We cannot fully demonstrate and clas-

sify error types and edits by using ERRANT, be-

cause Korean has many different characteristics

than English (Section 4.5). This motivates us to

develop an automated error correction system for

arXiv:2210.14389v3 [cs.CL] 24 May 2023

Korean (KAGAS), along with annotated error types

of reﬁned corpora using KAGAS.

Lastly, we build a simple yet effective baseline

model based on BART (Lewis et al.,2019) trained

from our datasets. We further analyze the gener-

ated outputs of BART on how the accuracy of each

system differs by error types when compared with

a statistical method called Hanspell,

providing use

cases and insights gained from analysis. To summa-

rize, the contributions of this paper are as follows:

(1) collection of three different types of parallel

corpora for Korean GEC, (2) a novel grammatical

error annotation toolkit for Korean called KAGAS,

and (3) a simple yet effective open-sourced base-

line Korean GEC models trained on our datasets

with detailed analysis by KAGAS.

2 Related Work

Datasets Well-curated datasets in each language

are crucial to build a GEC system that can capture

language-speciﬁc characteristics (Bender,2011).

In addition to several shared tasks on English GEC

(Ng et al.,2014;Bryant et al.,2019;Rao et al.,

2018), resources for GEC in other languages are

also available (Wu et al.,2018;Li et al.,2018;Ro-

zovskaya and Roth,2019;Koyama et al.,2020;

Boyd,2018). Existing works on Korean GEC (Min

et al.,2020;Lee et al.,2021;Park et al.,2020)

are challenging to be replicated because they use

internal datasets or existing datasets without provid-

ing pre-processing details and scripts. Therefore,

it is urgent to provide publicly available datasets

in a uniﬁed and easily accessible form with pre-

processing pipelines that are fully reproducible for

the GEC research on Korean.

Evaluation

scorer (Dahlmeier and Ng,2012)

which measures precision, recall, and

F0.5

scores

based on edits, is the standard evaluation metric for

English GEC models. It requires an

ﬁle with

annotations of edit paths from an erroneous sen-

tence to a corrected sentence. However, it is expen-

sive to collect the annotations by human workers

as they are often required to have expert linguistic

knowledge. When these annotations are not avail-

able, GLEU (Napoles et al.,2015), a simple variant

of BLEU (Papineni et al.,2002), is used instead by

the simple n-gram matching. Another way of gener-

ating an

ﬁle for English in a rule-based manner

is by using the error annotation toolkit called ER-

RANT (Bryant et al.,2017). We extend ERRANT

2https://speller.cs.pusan.ac.kr/

Kor-Learner Kor-Native Kor-Lang8

# Sentence pairs 28,426 17,559 109,559

Avg. token length 14.86 15.22 13.07

# Edits 59,419 29,975 262,833

# Edits / sentence 2.09 1.71 2.40

Avg. tokens per edit 0.97 1.40 0.92

Prop. tokens changed 28.01% 29.37% 39.42%

Table 1: Data statistics for Kor-Learner, Kor-Lang8, and

Kor-Native.

to make KAGAS and utilize it to align and anno-

tate edits on our datasets and make an

ﬁle to

evaluate on Korean GEC models.

Models Early works on Korean GEC focus on de-

tecting particle errors with statistical methods (Lee

et al.,2012;Israel et al.,2013;Dickinson et al.,

2011). A copy-augmented transformer (Zhao et al.,

2019) by pre-training to denoise and ﬁne-tuning

with paired data demonstrates remarkable perfor-

mance and is widely used in GEC. Recent studies

(Min et al.,2020;Lee et al.,2021;Park et al.,2020)

apply this method for Korean GEC. On the other

hand, Katsumata and Komachi (2020) show that

BART (Lewis et al.,2020), known to be effective

on conditioned generation tasks, can be used to

build a strong baseline for GEC systems. Follow-

ing this work, we load the pre-trained weights from

KoBART,

a Korean version of BART, and ﬁnetune

it using our GEC datasets.

3 Data Collection

We build three corpora for Korean GEC: Kor-

Learner (§3.1), Kor-Native (§3.2), and Kor-

Lang8 (§3.3). The statistics of each dataset is de-

scribed on Table 1. We describe the main character-

istic and source of the dataset and how it is prepro-

cessed in the following subsection. We expect that

different characteristics of these diverse datasets

in terms of quantity, quality, and error type distri-

butions (Figure 1) allow us to train and evaluate a

robust GEC model.

3.1 Korean Learner Corpus (Kor-Learner)

Korean Learner Corpus is made from the NIKL

learner corpus

. The NIKL learner corpus contains

essays written by Korean learners and their gram-

matical error correction annotations by their tutors

in an morpheme-level XML ﬁle format. The orig-

3https://github.com/SKT-AI/KoBART

The NIKL learner corpus is created by the National Insti-

tute of Korean Language (NIKL). Usage is allowed only for

research purposes, and citation of the origin (NIKL) is needed

when using it.

Figure 1: The distribution of error types by our proposed dataset (Upper three). Percentages are shown for error types

that has larger than 5%. The bottom row (Lang8) is for comparison with Kor-Lang8. We can see that Kor-Lang8 has

fewer outliers and with decrease in INS or DEL edits than the original Lang8 (bottom). The distribution of error

types for KFL learners show that NOUN, CONJ, END and PART are frequent than native learners, and word space

errors (purple) are the most frequent for native learners (Kor-Native), similar with previous corpus studies (Kim,

2020;Lee,2020).

inal format is described at Appendix A.4.1. Even

though the NIKL learner corpus contains annota-

tions by professional Korean tutors, it is not pos-

sible to directly be used as a corpus for training

and evaluation for two reasons. First, we cannot re-

cover the corrected sentence from the original ﬁle

nor convert the dataset into an

ﬁle format (Sec-

tion 2) since the dataset is given by morpheme-level

(syllable-level) correction annotations, not word-

level edits. A simple concatenation of morpheme-

level edits does not make a complete word since

Korean is an agglutinative language. Therefore, we

refer to the current Korean orthography guidelines

to merge morpheme-level syllables into Korean

words (Appendix A.4.3

). Second, some XML

ﬁles had empty edits, missing tags, and inconsis-

tent edit correction tags depending on annotators,

so additional reﬁnement and proofreading was re-

quired. Therefore, the authors manually inspected

the output of parallel corpora and discard sentences

with insufﬁcient annotations (Appendix A.4.2). Af-

ter applying appropriate modiﬁcations to the NIKL

corpus, we were able to make Kor-Learner which

contains word-level parallel sentences with high

quality.

3.2 Native Korean Corpus (Kor-Native)

The purpose of this corpus is to build a parallel cor-

pus representing grammatical errors native Korean

speakers make. Because the Korean orthography

guidelines are complicated consisting of 57 rules

with numerous exceptions,

only a few native Ko-

5http://kornorms.korean.go.kr/regltn/

regltnView.do

The merging codes are also open-sourced at our reposi-

tory.

rean speakers fully internalize all from the guide-

lines and apply them correctly. Thus, the standard

approach depends on the manpower of Korean lan-

guage experts, which is not scalable and is very

costly. Thus, we introduce our novel method to cre-

ate a large parallel GEC corpus from correct sen-

tences, which does not depend on the manpower

of experts, but the general public of native Korean

speakers. Our method is characterized as a back-

ward approach. We collect grammatically correct

sentences from two sources,

and read the correct

sentences using Google Text-to-Speech (TTS) sys-

tem. We asked the general public to dictate gram-

matically correct sentences and transcribe them.

The transcribed sentences may be incorrect, con-

taining grammatical errors that the audience often

makes. Figure 1shows that most of the collected

error types were on word spacing. While the distri-

butions of transcribed and written language cannot

be exactly identical, we observe that the error type

distribution of Kor-Native aligns with that of Native

Korean (Shin et al.,2015) in that they are domi-

nated by word spacing errors, which means that

the types of errors of Kor-Native can serve as a rea-

sonable representative to real-world writing errors

made by Native Korean. After the ﬁltering process

described in Appendix A.2, we have 17,559 sen-

tence pairs containing grammatical errors.

3.3 Lang-8 Korean Corpus (Kor-Lang8)

Lang-8

is one of the largest social platforms for

language learners (Mizumoto et al.,2011). We ex-

tract Korean data from the NAIST Lang-8 Learner

(1) The Center for Teaching and Learning for Korean, and

(2) National Institute of Korean language

8https://lang-8.com

Valid

loss

Self-

GLEU

GLEU on

KoBART

Dataset

size

Lang8 (Bef.) 1.53 15.01 19.69 204,130

Kor-Lang8(Aft.) 0.83 19.38 28.57 109,559

Table 2: Evaluation scores on the validation set for

Lang8 (Mizumoto et al.,2011), the original lang8

dataset ﬁltered by unique pairs in Korean, and Kor-

Lang8, which is after the reﬁnement by §3.3.

Corpora

by the language label, resulting in 21,779

Korean sentence pairs. However, some texts are

answers to language-related questions rather than

corrections. The texts inside the raw Lang-8 corpus

is noisy and not all of them form pairs, as previous

works with building Japanese corpus out of Lang-8

(Koyama et al.,2020) also pointed out. To build

a GEC dataset with high proportion of grammati-

cal edits, we ﬁltered out sentence pairs with a set

of cleanup rules regarding the Korean linguistics,

which is described in Appendix A.3.

Comparison with original Lang8. To prove the

increased quality of Kor-Lang8, we compare the

model training results and error type distribution

between the original Korean version of Lang8 and

Kor-Lang8. We perform minimum pre-processing

to the original Korean Lang8-data which discard

texts that do not have pairs and preserve unique

original-corrected sentence pairs to enable train-

ing and make a fair comparison with Kor-Lang8,

leaving out 204,130 pairs. Table 2shows that a

model trained with Kor-Lang8 achieve better re-

sults with lower validation loss, higher self-GLEU

scores (§5.1), and higher scores when trained with

KoBART, showing that there are fewer outliers

on Kor-Lang8.

Figure 1shows the difference

in error type distributions before and after Lang8

reﬁnement.

4 KAGAS

We propose Korean Automatic Grammatical error

Annotation System (KAGAS) that automatically

aligns edits and annotate error types on parallel cor-

pora that overcomes many disadvantages of hand-

written annotations by human (Appendix B.2). Fig-

ure 2shows the overview of KAGAS. As the scope

of the system is to extract edits and annotate an er-

ror type to each edit, our system assumes the given

9https://sites.google.com/site/

naistlang8corpora/

Since the redistribution of the NAIST Lang-8 Learner

Corpora is not allowed, we provide the full script used to

automatically make Kor-Lang8 with the permission of using

the corpora for the research purpose only.

Figure 2: An example of an

ﬁle output by KAGAS.

Translated into English. Note that "to school" is treated

as one word for the translation example.

corrected sentence is grammatically correct. Then,

our system takes a pair of the original sentence and

the corrected sentence as input and output aligned

edits with error types. We further extend the us-

age of KAGAS to analyze the generated text of

our baseline models by each error type in Table 7

at Section 6. In this section, we describe in detail

about the construction and contributions of KA-

GAS, with human evaluation results.

4.1

Automatic Error Annotation for other lan-

guages

Creating a sufﬁcient amount of human-annotated

dataset for GEC on other languages is not triv-

ial. To navigate this problem, there were attempts

to adapt ERRANT (Bryant et al.,2017) onto lan-

guages other than English for error type annotation,

such as on Czech (Náplava et al.,2022), Hindi

(Sonawane et al.,2020), Russian (Katinskaia et al.,

2022), German (Boyd,2018), and Arabic (Belke-

bir and Habash,2021), but no existing work has

previously extended ERRANT onto Korean. When

extending ERRANT onto other languages, neces-

sary changes about the error types were made such

as discarding unmatched error types by ERRANT

and adding language-speciﬁc error types.11.

4.2 Alignment Strategy

Before classifying error types, we need to ﬁnd

where the edits are from parallel text. We ﬁrst con-

duct sentence-level alignment to deﬁne a "single

edit". We use Damerau-Levenshtein distance (Fe-

lice et al.,2016) by the edit extraction repository

VERB:SVA was discarded and DIACR was added for

Czech (Náplava et al.,2022) In a similar way, KAGAS made

necessary changes on error types regarding the linguistic fea-

ture of Korean, such as discarding VERB:SVA and adding

WS.

12https://github.com/chrisjbryant/

edit-extraction

Error

Code

Description &

Acceptance Rate (%) Example

INS A word is inserted.

100.00% ± 0.00%P

Original: 고등학교때어긴경험

Corrected: 고등학교때규칙을어긴경험

Translation: Experience to break a rule in high school

DEL A word is deleted.

100.00% ± 0.00%P

Original: 전쟁끝직후장군들은사형을선고받았다.

Corrected: 전쟁직후장군들은사형을선고받았다.

Translation: After the war, the generals are sentenced to death.

Spacing between

words is changed.

100.00% ± 0.00%P

Original: 이옷은더러워요.

Corrected: 이옷은더러워요.

Translation: This cloth is dirty.

The order of

words is changed.

97.44% ± 3.51%P

Original: 저는더한국어를배우고싶어요.

Corrected: 저는한국어를더배우고싶어요.

Translation: I want to learn Korean further.

SPELL Spelling error

97.44% ± 3.51%P

Original: 파티에서우리는춤을쳐요.

Corrected: 파티에서우리는춤을춰요.

Translation: We dance at the party.

PUNCT Punctuation error

98.72% ± 2.50%P

Original: 1993년의겨울의 일이였다.

Corrected: 1993년,겨울의 일이였다.

Translation: It was 1993,a happening in winter.

SHORT

An edit that does not change

the structure of morphemes.

73.08% ± 9.84%P

Original: 한국어는저한테너무어려운언어이었어요.

Corrected: 한국어는저한테너무어려운언어였어요.

Translation: Korean Language was too difﬁcult to me.

VERB An error on verb

79.49% ± 8.96%P

Original: 어제친구에게편지를쌌어요.

Corrected: 어제친구에게편지를썼어요.

Translation: I wrote a letter to my friend yesterday.

ADJ An error on adjective

73.08% ± 9.84%P

Original: 진한친구

Corrected: 친한친구

Translation: A close friend.

NOUN An error on noun

75.64% ± 9.53%P

Original: 나중에기회가있을 때한국에유학러가고 싶습니다.

Corrected: 나중에기회가있을 때한국에유학가고 싶습니다.

Translation: I want to study abroad in Korea in the future.

PART An error on particle

97.44% ± 3.51%P

Original: 하와이에서사는우리사촌

Corrected: 하와이에사는우리사촌

Translation: My cousin living in Hawaii

END An error on ending

87.18% ± 7.42%P

Original: 오래기다려요.

Corrected: 오래기다렸어요.

Translation: I waited for a long time.

MOD An error on modiﬁer

89.74% ± 6.73%P

Original: 점심이나무작은 나머지배고팠어요.

Corrected: 점심이너무작은 나머지배고팠어요.

Translation: I was hungry because I had such a small launch.

CONJ An error on conjugation

43.59% ± 11.00%P

Original: 오늘은머리를 잘라에갔다.

Corrected: 오늘은머리를 자르러갔다.

Translation: I went to a barber to get my hair cut today.

Table 3: Full category of error types used in KAGAS. Middle column shows acceptance rates by each error type

on human evaluation along with explanations. Rightmost column shows examples of each error types. Others are

classiﬁed as UNK.

to get edit pairs. Note that we apply different align-

ment strategy from ERRANT on the scope of a "sin-

gle" edit. We use Korean-speciﬁc linguistic cost,

so that word pairs with lower POS cost and lower

lemma cost are more likely to be aligned together.

Also, we use custom merging rules to merge single

word-level edits into WO and WS. Therefore, the

number of total edits and average token length on

edits, and the output

ﬁle made from KAGAS

differs from that of ERRANT, since an

ﬁle con-

sists of edit alignment and error type (Fig. 2). This

would result in different

scores when applied

to model output evaluation.

13Kkma POS tagger, Korean lemmatizer

4.3 Error types for Korean

We describe how we consider the unique linguis-

tic characteristics of Korean (Appendix B.1), and

deﬁne 14 error types (Table 3).

Classifying error types in morpheme-level As

Korean is an agglutinative language, difference

between original and corrected word is naturally

deﬁned in morpheme-level. For example,

학교

에

(’to school’) in Table 2is divided into two

parts,

학교

(’school’) +

에

(’to’), based on its roles

in a word. If this word is corrected to

집에

(’to

home’), we should treat this edit as NOUN (

학교

집

), and if this word is corrected to

학교에서

(’at school’), we should classify this edit as PART

(Particle, since -

서

is added). We need to break

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsstandardizingKoreanGrammaticalErrorCorrection:DatasetsandAnnotationSoyoungYoonKAISTAIsoyoungyoon@kaist.ac.krSungjoonParkSoftlyAIResearch,SoftlyAIsungjoon.park@softly.aiGyuwanKimUCSB∗gyuwankim@ucsb.eduJunheeChoGooglejunheecho@google.comKihyoParkCornellUniversitykp467@cornell.eduGyutaeKimSoftly...

展开>> 收起<<

Towards standardizing Korean Grammatical Error Correction Datasets and Annotation Soyoung Yoon.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards standardizing Korean Grammatical Error Correction Datasets and Annotation Soyoung Yoon

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: