MCSCSet: A Specialist-annotated Dataset for Medical-domain
Chinese Spelling Correction
Wangjie Jiang†, Zhihao Ye‡, Zijing Ou♠, Ruihui Zhao‡, Jianguang Zheng‡, Yi Liu‡,
Siheng Li†, Bang Liu♣, Yujiu Yang†and Yefeng Zheng‡
†Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
‡Tencent Jarvis Lab, Shenzhen, China
♠Sun Yat-sen University, Guangzhou, China
♣Université de Montréal Mila & CIFAR, Québec, Canada
{jwj20,lisiheng21}@mails.tsinghua.edu.cn,{evanzhye,zacharyzhao,jaxzheng,yefengzheng}@tencent.com
{zijingou.mail,97liuyi}@gmail.com,bang.liu@umontreal.ca,yang.yujiu@sz.tsinghua.edu.cn
ABSTRACT
Chinese Spelling Correction (CSC) is gaining increasing attention
due to its promise of automatically detecting and correcting spelling
errors in Chinese texts. Despite its extensive use in many applica-
tions, like search engines and optical character recognition systems,
little has been explored in medical scenarios in which complex
and uncommon medical entities are easily misspelled. Correct-
ing the misspellings of medical entities is arguably more dicult
than those in the open domain due to its requirements of specic
domain knowledge. In this work, we dene the task of Medical-
domain Chinese Spelling Correction and propose MCSCSet, a large-
scale specialist-annotated dataset that contains about 200k samples.
In contrast to the existing open-domain CSC datasets, MCSCSet
involves: i) extensive real-world medical queries collected from
Tencent Yidian, ii) corresponding misspelled sentences manually
annotated by medical specialists. To ensure automated dataset cu-
ration, MCSCSet further oers a medical confusion set consisting
of the commonly misspelled characters of given Chinese medical
terms. This enables one to create the medical misspelling dataset
automatically. Extensive empirical studies have shown signicant
performance gaps between the open-domain and medical-domain
spelling correction, highlighting the need to develop high-quality
datasets that allow for Chinese spelling correction in specic do-
mains. Moreover, our work benchmarks several representative Chi-
nese spelling correction models, establishing baselines for future
work.
1 INTRODUCTION
Misspelled characters frequently occur in hand-crafted Chinese sen-
tences, easily leading to a wrong understanding of these sentences.
To this end, we need a corrector to automatically detect and correct
spelling mistakes in the text. The task of Chinese Spelling Correc-
tion (CSC) is to design such a corrector to correct spelling errors,
which plays a vital role in various Natural Language Processing
(NLP) applications such as search engine [
19
] and optical character
recognition system [
1
]. To achieve the goal of ecient error cor-
rection, previous work has mainly focused on designing advanced
error correction models [
3
,
11
,
36
,
40
] and establishing canonical
benchmark spelling correction corpora [
17
,
25
,
30
,
37
]. For example,
a well-known open-domain spelling correction corpus, SIGHAN-
15 [
25
], is a Chinese spelling correction corpus collected from a
computer-based Test of Chinese as a Foreign Language (TOCFL).
Although these models and benchmark datasets provide people
with high-quality spelling error correction services in the open
domain, their eectivenss is reduced signicantly in some specic
domains, such as the medical domain. The reason is that open-
domain corpora do not contain complex medical terms, and the
spelling of medical terms requires specialized domain knowledge
that ordinary people usually lack [21, 42].
Additionally, Chinese spelling correction for medical terms plays
a crucial role in promoting the standardization and healthy devel-
opment of the medical eld [
43
]. Indeed, Chinese spelling cor-
rectors may improve the quality of medical application services,
especially medical entity search systems, by automatically correct-
ing medical terms with misspellings. Specically, we get incorrect
answers when using the medical entity query system to query med-
ical terms with misspellings, leading to user misunderstandings
and even severe medical malpractice. For example, when a doctor’s
hand-crafted electronic medical record contains misspellings for a
malignant disease, a patient queries the term and may get results
that misidentify themselves as having another illness, delaying pa-
tient care, and aecting a healthy doctor-patient relationship. This
indicates that the spelling error in the medical eld, especially in
the medical entity query scenario, need to be corrected and resolved
urgently [
35
]. Therefore, we need to nd an eective way to correct
spelling mistakes in the medical domain.
To achieve this goal, a straightforward method is to directly
apply advanced methods [
10
,
12
,
17
,
33
,
39
] in open-domain CSC to
medical-domain CSC. However, such a method is likely to fail on
the medical CSC task due to the oset of the corresponding domain
knowledge. To verify this, we choose an advanced BERT-based CSC
model [
15
], which is rst pre-trained on large-scale automatically-
generated CSC data and then ne-tuned on SIGHAN-15. Then we
validate the model on the test sets of SIGHAN-15 and our proposed
medical-domain dataset in this paper. The experimental results are
shown in Table 1, and it can be seen that such a naive method
shows a signicant performance gap between in-domain and out-
of-domain experiments. We conjecture that this is because the
distribution of spelling errors diers signicantly between an open
domain and a specic domain. For instance, in Chinese medical
texts, the vast majority of spelling errors occur in those complex
and uncommon medical entities, which rarely occur in the open-
domain Chinese texts, e.g.,SIGHAN-15, which is collected from
TOCFL. In particular, we summarize the errors of medical terms
1
arXiv:2210.11720v1 [cs.CL] 21 Oct 2022