Towards standardizing Korean Grammatical Error Correction Datasets and Annotation Soyoung Yoon

2025-05-06 0 0 3.5MB 27 页 10玖币
侵权投诉
Towards standardizing Korean Grammatical Error Correction:
Datasets and Annotation
Soyoung Yoon
KAIST AI
soyoungyoon@kaist.ac.kr
Sungjoon Park
SoftlyAI Research, SoftlyAI
sungjoon.park@softly.ai
Gyuwan Kim
UCSB
gyuwankim@ucsb.edu
Junhee Cho
Google
junheecho@google.com
Kihyo Park
Cornell University
kp467@cornell.edu
Gyutae Kim
SoftlyAI
gt.kim@softly.ai
Minjoon Seo
KAIST AI
minjoon@kaist.ac.kr
Alice Oh
KAIST SC
alice.oh@kaist.edu
Abstract
Research on Korean grammatical error correc-
tion (GEC) is limited, compared to other ma-
jor languages such as English. We attribute
this problematic circumstance to the lack of
a carefully designed evaluation benchmark for
Korean GEC. In this work, we collect three
datasets from different sources (Kor-Lang8,
Kor-Native, and Kor-Learner) that covers a
wide range of Korean grammatical errors. Con-
sidering the nature of Korean grammar, We
then define 14 error types for Korean and pro-
vide KAGAS (Korean Automatic Grammatical
error Annotation System), which can automati-
cally annotate error types from parallel corpora.
We use KAGAS on our datasets to make an
evaluation benchmark for Korean, and present
baseline models trained from our datasets. We
show that the model trained with our datasets
significantly outperforms the currently used sta-
tistical Korean GEC system (Hanspell) on a
wider range of error types, demonstrating the
diversity and usefulness of the datasets. The im-
plementations and datasets are open-sourced.1
1 Introduction
Writing grammatically correct Korean sentences
is difficult for learners studying Korean as a For-
eign Language (KFL) and even for native Korean
speakers due to its morphological and orthographi-
cal complexity such as particles, spelling, and col-
location. Its word spacing rule is complex since
there are many domain-dependent exceptions, of
which only around 20% of native speakers under-
stand thoroughly (Lee,2014). Since Korean is an
agglutinative language (Sohn,2001;Song,2006),
getting used to Korean grammar is time-consuming
for KFL learners whose mother tongue is non-
agglutinative (Haupt et al.,2017;Kim,2020). How-
University of California, Santa Barbara
School of Computing, KAIST
1
Code for model checkpoints and KAGAS:
https:
//github.com/soyoung97/Standard_Korean_GEC
. Dataset
request form: https://forms.gle/kF9pvJbLGvnh8ZnQ6
ever, despite the growing number of KFL learners
(Lee,2018), little research has been conducted on
Korean Grammatical Error Correction (GEC) be-
cause of the previously described difficulties of the
Korean language. Another major obstacle to devel-
oping a Korean GEC system is the lack of resources
to train a machine learning model.
In this paper, we propose three datasets that
cover various grammatical errors from different
types of annotators and learners. The first dataset
named Kor-Native is crowd-sourced from native
Korean speakers. Second, Kor-Learner are from
KFL learners that consists of essays with de-
tailed corrections and annotations by Korean tutors.
Third, Kor-Lang8 are similar with Kor-Learner ex-
cept that they consist of sentences made by KFL
learners but corrected by native Koreans on social
platforms who are not necessarily linguistic experts.
We also analyze our datasets in terms of error type
distributions.
While our proposed parallel corpora can be
served as a valuable resource to train a machine
learning model, another concern is about the an-
notation of the datasets. Most existing datasets do
not have annotation, which makes it hard to use
them for evaluation. A major weakness of human
annotation is that (1) experts specialized in Korean
Grammar are expensive to hire, (2) making them
annotate a large number of parallel corpora is not
scalable, and (3) the error types and schema are dif-
ferent by datasets and annotators, which is counter-
productive. Another way that we can analyze and
evaluate on the dataset is by automatic annotation
from parallel corpora. While there is already one
for English called ERRANT (Bryant et al.,2017),
there is no automatic error type detection system
for Korean. We cannot fully demonstrate and clas-
sify error types and edits by using ERRANT, be-
cause Korean has many different characteristics
than English (Section 4.5). This motivates us to
develop an automated error correction system for
arXiv:2210.14389v3 [cs.CL] 24 May 2023
Korean (KAGAS), along with annotated error types
of refined corpora using KAGAS.
Lastly, we build a simple yet effective baseline
model based on BART (Lewis et al.,2019) trained
from our datasets. We further analyze the gener-
ated outputs of BART on how the accuracy of each
system differs by error types when compared with
a statistical method called Hanspell,
2
providing use
cases and insights gained from analysis. To summa-
rize, the contributions of this paper are as follows:
(1) collection of three different types of parallel
corpora for Korean GEC, (2) a novel grammatical
error annotation toolkit for Korean called KAGAS,
and (3) a simple yet effective open-sourced base-
line Korean GEC models trained on our datasets
with detailed analysis by KAGAS.
2 Related Work
Datasets Well-curated datasets in each language
are crucial to build a GEC system that can capture
language-specific characteristics (Bender,2011).
In addition to several shared tasks on English GEC
(Ng et al.,2014;Bryant et al.,2019;Rao et al.,
2018), resources for GEC in other languages are
also available (Wu et al.,2018;Li et al.,2018;Ro-
zovskaya and Roth,2019;Koyama et al.,2020;
Boyd,2018). Existing works on Korean GEC (Min
et al.,2020;Lee et al.,2021;Park et al.,2020)
are challenging to be replicated because they use
internal datasets or existing datasets without provid-
ing pre-processing details and scripts. Therefore,
it is urgent to provide publicly available datasets
in a unified and easily accessible form with pre-
processing pipelines that are fully reproducible for
the GEC research on Korean.
Evaluation
M2
scorer (Dahlmeier and Ng,2012)
which measures precision, recall, and
F0.5
scores
based on edits, is the standard evaluation metric for
English GEC models. It requires an
M2
file with
annotations of edit paths from an erroneous sen-
tence to a corrected sentence. However, it is expen-
sive to collect the annotations by human workers
as they are often required to have expert linguistic
knowledge. When these annotations are not avail-
able, GLEU (Napoles et al.,2015), a simple variant
of BLEU (Papineni et al.,2002), is used instead by
the simple n-gram matching. Another way of gener-
ating an
M2
file for English in a rule-based manner
is by using the error annotation toolkit called ER-
RANT (Bryant et al.,2017). We extend ERRANT
2https://speller.cs.pusan.ac.kr/
Kor-Learner Kor-Native Kor-Lang8
# Sentence pairs 28,426 17,559 109,559
Avg. token length 14.86 15.22 13.07
# Edits 59,419 29,975 262,833
# Edits / sentence 2.09 1.71 2.40
Avg. tokens per edit 0.97 1.40 0.92
Prop. tokens changed 28.01% 29.37% 39.42%
Table 1: Data statistics for Kor-Learner, Kor-Lang8, and
Kor-Native.
to make KAGAS and utilize it to align and anno-
tate edits on our datasets and make an
M2
file to
evaluate on Korean GEC models.
Models Early works on Korean GEC focus on de-
tecting particle errors with statistical methods (Lee
et al.,2012;Israel et al.,2013;Dickinson et al.,
2011). A copy-augmented transformer (Zhao et al.,
2019) by pre-training to denoise and fine-tuning
with paired data demonstrates remarkable perfor-
mance and is widely used in GEC. Recent studies
(Min et al.,2020;Lee et al.,2021;Park et al.,2020)
apply this method for Korean GEC. On the other
hand, Katsumata and Komachi (2020) show that
BART (Lewis et al.,2020), known to be effective
on conditioned generation tasks, can be used to
build a strong baseline for GEC systems. Follow-
ing this work, we load the pre-trained weights from
KoBART,
3
a Korean version of BART, and finetune
it using our GEC datasets.
3 Data Collection
We build three corpora for Korean GEC: Kor-
Learner (§3.1), Kor-Native (§3.2), and Kor-
Lang8 (§3.3). The statistics of each dataset is de-
scribed on Table 1. We describe the main character-
istic and source of the dataset and how it is prepro-
cessed in the following subsection. We expect that
different characteristics of these diverse datasets
in terms of quantity, quality, and error type distri-
butions (Figure 1) allow us to train and evaluate a
robust GEC model.
3.1 Korean Learner Corpus (Kor-Learner)
Korean Learner Corpus is made from the NIKL
learner corpus
4
. The NIKL learner corpus contains
essays written by Korean learners and their gram-
matical error correction annotations by their tutors
in an morpheme-level XML file format. The orig-
3https://github.com/SKT-AI/KoBART
4
The NIKL learner corpus is created by the National Insti-
tute of Korean Language (NIKL). Usage is allowed only for
research purposes, and citation of the origin (NIKL) is needed
when using it.
Figure 1: The distribution of error types by our proposed dataset (Upper three). Percentages are shown for error types
that has larger than 5%. The bottom row (Lang8) is for comparison with Kor-Lang8. We can see that Kor-Lang8 has
fewer outliers and with decrease in INS or DEL edits than the original Lang8 (bottom). The distribution of error
types for KFL learners show that NOUN, CONJ, END and PART are frequent than native learners, and word space
errors (purple) are the most frequent for native learners (Kor-Native), similar with previous corpus studies (Kim,
2020;Lee,2020).
inal format is described at Appendix A.4.1. Even
though the NIKL learner corpus contains annota-
tions by professional Korean tutors, it is not pos-
sible to directly be used as a corpus for training
and evaluation for two reasons. First, we cannot re-
cover the corrected sentence from the original file
nor convert the dataset into an
M2
file format (Sec-
tion 2) since the dataset is given by morpheme-level
(syllable-level) correction annotations, not word-
level edits. A simple concatenation of morpheme-
level edits does not make a complete word since
Korean is an agglutinative language. Therefore, we
refer to the current Korean orthography guidelines
5
to merge morpheme-level syllables into Korean
words (Appendix A.4.3
6
). Second, some XML
files had empty edits, missing tags, and inconsis-
tent edit correction tags depending on annotators,
so additional refinement and proofreading was re-
quired. Therefore, the authors manually inspected
the output of parallel corpora and discard sentences
with insufficient annotations (Appendix A.4.2). Af-
ter applying appropriate modifications to the NIKL
corpus, we were able to make Kor-Learner which
contains word-level parallel sentences with high
quality.
3.2 Native Korean Corpus (Kor-Native)
The purpose of this corpus is to build a parallel cor-
pus representing grammatical errors native Korean
speakers make. Because the Korean orthography
guidelines are complicated consisting of 57 rules
with numerous exceptions,
5
only a few native Ko-
5http://kornorms.korean.go.kr/regltn/
regltnView.do
6
The merging codes are also open-sourced at our reposi-
tory.
rean speakers fully internalize all from the guide-
lines and apply them correctly. Thus, the standard
approach depends on the manpower of Korean lan-
guage experts, which is not scalable and is very
costly. Thus, we introduce our novel method to cre-
ate a large parallel GEC corpus from correct sen-
tences, which does not depend on the manpower
of experts, but the general public of native Korean
speakers. Our method is characterized as a back-
ward approach. We collect grammatically correct
sentences from two sources,
7
and read the correct
sentences using Google Text-to-Speech (TTS) sys-
tem. We asked the general public to dictate gram-
matically correct sentences and transcribe them.
The transcribed sentences may be incorrect, con-
taining grammatical errors that the audience often
makes. Figure 1shows that most of the collected
error types were on word spacing. While the distri-
butions of transcribed and written language cannot
be exactly identical, we observe that the error type
distribution of Kor-Native aligns with that of Native
Korean (Shin et al.,2015) in that they are domi-
nated by word spacing errors, which means that
the types of errors of Kor-Native can serve as a rea-
sonable representative to real-world writing errors
made by Native Korean. After the filtering process
described in Appendix A.2, we have 17,559 sen-
tence pairs containing grammatical errors.
3.3 Lang-8 Korean Corpus (Kor-Lang8)
Lang-8
8
is one of the largest social platforms for
language learners (Mizumoto et al.,2011). We ex-
tract Korean data from the NAIST Lang-8 Learner
7
(1) The Center for Teaching and Learning for Korean, and
(2) National Institute of Korean language
8https://lang-8.com
Valid
loss
Self-
GLEU
GLEU on
KoBART
Dataset
size
Lang8 (Bef.) 1.53 15.01 19.69 204,130
Kor-Lang8(Aft.) 0.83 19.38 28.57 109,559
Table 2: Evaluation scores on the validation set for
Lang8 (Mizumoto et al.,2011), the original lang8
dataset filtered by unique pairs in Korean, and Kor-
Lang8, which is after the refinement by §3.3.
Corpora
9
by the language label, resulting in 21,779
Korean sentence pairs. However, some texts are
answers to language-related questions rather than
corrections. The texts inside the raw Lang-8 corpus
is noisy and not all of them form pairs, as previous
works with building Japanese corpus out of Lang-8
(Koyama et al.,2020) also pointed out. To build
a GEC dataset with high proportion of grammati-
cal edits, we filtered out sentence pairs with a set
of cleanup rules regarding the Korean linguistics,
which is described in Appendix A.3.
Comparison with original Lang8. To prove the
increased quality of Kor-Lang8, we compare the
model training results and error type distribution
between the original Korean version of Lang8 and
Kor-Lang8. We perform minimum pre-processing
to the original Korean Lang8-data which discard
texts that do not have pairs and preserve unique
original-corrected sentence pairs to enable train-
ing and make a fair comparison with Kor-Lang8,
leaving out 204,130 pairs. Table 2shows that a
model trained with Kor-Lang8 achieve better re-
sults with lower validation loss, higher self-GLEU
scores (§5.1), and higher scores when trained with
KoBART, showing that there are fewer outliers
on Kor-Lang8.
10
Figure 1shows the difference
in error type distributions before and after Lang8
refinement.
4 KAGAS
We propose Korean Automatic Grammatical error
Annotation System (KAGAS) that automatically
aligns edits and annotate error types on parallel cor-
pora that overcomes many disadvantages of hand-
written annotations by human (Appendix B.2). Fig-
ure 2shows the overview of KAGAS. As the scope
of the system is to extract edits and annotate an er-
ror type to each edit, our system assumes the given
9https://sites.google.com/site/
naistlang8corpora/
10
Since the redistribution of the NAIST Lang-8 Learner
Corpora is not allowed, we provide the full script used to
automatically make Kor-Lang8 with the permission of using
the corpora for the research purpose only.
Figure 2: An example of an
M2
file output by KAGAS.
Translated into English. Note that "to school" is treated
as one word for the translation example.
corrected sentence is grammatically correct. Then,
our system takes a pair of the original sentence and
the corrected sentence as input and output aligned
edits with error types. We further extend the us-
age of KAGAS to analyze the generated text of
our baseline models by each error type in Table 7
at Section 6. In this section, we describe in detail
about the construction and contributions of KA-
GAS, with human evaluation results.
4.1
Automatic Error Annotation for other lan-
guages
Creating a sufficient amount of human-annotated
dataset for GEC on other languages is not triv-
ial. To navigate this problem, there were attempts
to adapt ERRANT (Bryant et al.,2017) onto lan-
guages other than English for error type annotation,
such as on Czech (Náplava et al.,2022), Hindi
(Sonawane et al.,2020), Russian (Katinskaia et al.,
2022), German (Boyd,2018), and Arabic (Belke-
bir and Habash,2021), but no existing work has
previously extended ERRANT onto Korean. When
extending ERRANT onto other languages, neces-
sary changes about the error types were made such
as discarding unmatched error types by ERRANT
and adding language-specific error types.11.
4.2 Alignment Strategy
Before classifying error types, we need to find
where the edits are from parallel text. We first con-
duct sentence-level alignment to define a "single
edit". We use Damerau-Levenshtein distance (Fe-
lice et al.,2016) by the edit extraction repository
12
11
VERB:SVA was discarded and DIACR was added for
Czech (Náplava et al.,2022) In a similar way, KAGAS made
necessary changes on error types regarding the linguistic fea-
ture of Korean, such as discarding VERB:SVA and adding
WS.
12https://github.com/chrisjbryant/
edit-extraction
Error
Code
Description &
Acceptance Rate (%) Example
INS A word is inserted.
100.00% ± 0.00%P
Original:
Corrected:
Translation: Experience to break a rule in high school
DEL A word is deleted.
100.00% ± 0.00%P
Original: .
Corrected: .
Translation: After the war, the generals are sentenced to death.
WS
Spacing between
words is changed.
100.00% ± 0.00%P
Original: 워요.
Corrected: 워요.
Translation: This cloth is dirty.
WO
The order of
words is changed.
97.44% ± 3.51%P
Original: .
Corrected: .
Translation: I want to learn Korean further.
SPELL Spelling error
97.44% ± 3.51%P
Original: .
Corrected: .
Translation: We dance at the party.
PUNCT Punctuation error
98.72% ± 2.50%P
Original: 1993의 일이.
Corrected: 1993,의 일이.
Translation: It was 1993,a happening in winter.
SHORT
An edit that does not change
the structure of morphemes.
73.08% ± 9.84%P
Original: 언어었어.
Corrected: 언어.
Translation: Korean Language was too difficult to me.
VERB An error on verb
79.49% ± 8.96%P
Original: .
Corrected: .
Translation: I wrote a letter to my friend yesterday.
ADJ An error on adjective
73.08% ± 9.84%P
Original:
Corrected:
Translation: A close friend.
NOUN An error on noun
75.64% ± 9.53%P
Original: 있을 가고 싶습니다.
Corrected: 있을 가고 싶습니다.
Translation: I want to study abroad in Korea in the future.
PART An error on particle
97.44% ± 3.51%P
Original:
Corrected:
Translation: My cousin living in Hawaii
END An error on ending
87.18% ± 7.42%P
Original: .
Corrected: .
Translation: I waited for a long time.
MOD An error on modifier
89.74% ± 6.73%P
Original: 작은 .
Corrected: 작은 .
Translation: I was hungry because I had such a small launch.
CONJ An error on conjugation
43.59% ± 11.00%P
Original: 리를 .
Corrected: 리를 .
Translation: I went to a barber to get my hair cut today.
Table 3: Full category of error types used in KAGAS. Middle column shows acceptance rates by each error type
on human evaluation along with explanations. Rightmost column shows examples of each error types. Others are
classified as UNK.
to get edit pairs. Note that we apply different align-
ment strategy from ERRANT on the scope of a "sin-
gle" edit. We use Korean-specific linguistic cost,
13
so that word pairs with lower POS cost and lower
lemma cost are more likely to be aligned together.
Also, we use custom merging rules to merge single
word-level edits into WO and WS. Therefore, the
number of total edits and average token length on
edits, and the output
M2
file made from KAGAS
differs from that of ERRANT, since an
M2
file con-
sists of edit alignment and error type (Fig. 2). This
would result in different
M2
scores when applied
to model output evaluation.
13Kkma POS tagger, Korean lemmatizer
4.3 Error types for Korean
We describe how we consider the unique linguis-
tic characteristics of Korean (Appendix B.1), and
define 14 error types (Table 3).
Classifying error types in morpheme-level As
Korean is an agglutinative language, difference
between original and corrected word is naturally
defined in morpheme-level. For example,
(’to school’) in Table 2is divided into two
parts,
(’school’) +
(’to’), based on its roles
in a word. If this word is corrected to
(’to
home’), we should treat this edit as NOUN (
->
), and if this word is corrected to
(’at school’), we should classify this edit as PART
(Particle, since -
is added). We need to break
摘要:

TowardsstandardizingKoreanGrammaticalErrorCorrection:DatasetsandAnnotationSoyoungYoonKAISTAIsoyoungyoon@kaist.ac.krSungjoonParkSoftlyAIResearch,SoftlyAIsungjoon.park@softly.aiGyuwanKimUCSB∗gyuwankim@ucsb.eduJunheeChoGooglejunheecho@google.comKihyoParkCornellUniversitykp467@cornell.eduGyutaeKimSoftly...

展开>> 收起<<
Towards standardizing Korean Grammatical Error Correction Datasets and Annotation Soyoung Yoon.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:3.5MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注