
Korean (KAGAS), along with annotated error types
of refined corpora using KAGAS.
Lastly, we build a simple yet effective baseline
model based on BART (Lewis et al.,2019) trained
from our datasets. We further analyze the gener-
ated outputs of BART on how the accuracy of each
system differs by error types when compared with
a statistical method called Hanspell,
2
providing use
cases and insights gained from analysis. To summa-
rize, the contributions of this paper are as follows:
(1) collection of three different types of parallel
corpora for Korean GEC, (2) a novel grammatical
error annotation toolkit for Korean called KAGAS,
and (3) a simple yet effective open-sourced base-
line Korean GEC models trained on our datasets
with detailed analysis by KAGAS.
2 Related Work
Datasets Well-curated datasets in each language
are crucial to build a GEC system that can capture
language-specific characteristics (Bender,2011).
In addition to several shared tasks on English GEC
(Ng et al.,2014;Bryant et al.,2019;Rao et al.,
2018), resources for GEC in other languages are
also available (Wu et al.,2018;Li et al.,2018;Ro-
zovskaya and Roth,2019;Koyama et al.,2020;
Boyd,2018). Existing works on Korean GEC (Min
et al.,2020;Lee et al.,2021;Park et al.,2020)
are challenging to be replicated because they use
internal datasets or existing datasets without provid-
ing pre-processing details and scripts. Therefore,
it is urgent to provide publicly available datasets
in a unified and easily accessible form with pre-
processing pipelines that are fully reproducible for
the GEC research on Korean.
Evaluation
M2
scorer (Dahlmeier and Ng,2012)
which measures precision, recall, and
F0.5
scores
based on edits, is the standard evaluation metric for
English GEC models. It requires an
M2
file with
annotations of edit paths from an erroneous sen-
tence to a corrected sentence. However, it is expen-
sive to collect the annotations by human workers
as they are often required to have expert linguistic
knowledge. When these annotations are not avail-
able, GLEU (Napoles et al.,2015), a simple variant
of BLEU (Papineni et al.,2002), is used instead by
the simple n-gram matching. Another way of gener-
ating an
M2
file for English in a rule-based manner
is by using the error annotation toolkit called ER-
RANT (Bryant et al.,2017). We extend ERRANT
2https://speller.cs.pusan.ac.kr/
Kor-Learner Kor-Native Kor-Lang8
# Sentence pairs 28,426 17,559 109,559
Avg. token length 14.86 15.22 13.07
# Edits 59,419 29,975 262,833
# Edits / sentence 2.09 1.71 2.40
Avg. tokens per edit 0.97 1.40 0.92
Prop. tokens changed 28.01% 29.37% 39.42%
Table 1: Data statistics for Kor-Learner, Kor-Lang8, and
Kor-Native.
to make KAGAS and utilize it to align and anno-
tate edits on our datasets and make an
M2
file to
evaluate on Korean GEC models.
Models Early works on Korean GEC focus on de-
tecting particle errors with statistical methods (Lee
et al.,2012;Israel et al.,2013;Dickinson et al.,
2011). A copy-augmented transformer (Zhao et al.,
2019) by pre-training to denoise and fine-tuning
with paired data demonstrates remarkable perfor-
mance and is widely used in GEC. Recent studies
(Min et al.,2020;Lee et al.,2021;Park et al.,2020)
apply this method for Korean GEC. On the other
hand, Katsumata and Komachi (2020) show that
BART (Lewis et al.,2020), known to be effective
on conditioned generation tasks, can be used to
build a strong baseline for GEC systems. Follow-
ing this work, we load the pre-trained weights from
KoBART,
3
a Korean version of BART, and finetune
it using our GEC datasets.
3 Data Collection
We build three corpora for Korean GEC: Kor-
Learner (§3.1), Kor-Native (§3.2), and Kor-
Lang8 (§3.3). The statistics of each dataset is de-
scribed on Table 1. We describe the main character-
istic and source of the dataset and how it is prepro-
cessed in the following subsection. We expect that
different characteristics of these diverse datasets
in terms of quantity, quality, and error type distri-
butions (Figure 1) allow us to train and evaluate a
robust GEC model.
3.1 Korean Learner Corpus (Kor-Learner)
Korean Learner Corpus is made from the NIKL
learner corpus
4
. The NIKL learner corpus contains
essays written by Korean learners and their gram-
matical error correction annotations by their tutors
in an morpheme-level XML file format. The orig-
3https://github.com/SKT-AI/KoBART
4
The NIKL learner corpus is created by the National Insti-
tute of Korean Language (NIKL). Usage is allowed only for
research purposes, and citation of the origin (NIKL) is needed
when using it.