MCSCSet A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction
2025-05-02
1
0
1.27MB
9 页
10玖币
侵权投诉
MCSCSet: A Specialist-annotated Dataset for Medical-domain
Chinese Spelling Correction
Wangjie Jiang†, Zhihao Ye‡, Zijing Ou♠, Ruihui Zhao‡, Jianguang Zheng‡, Yi Liu‡,
Siheng Li†, Bang Liu♣, Yujiu Yang†and Yefeng Zheng‡
†Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
‡Tencent Jarvis Lab, Shenzhen, China
♠Sun Yat-sen University, Guangzhou, China
♣Université de Montréal Mila & CIFAR, Québec, Canada
{jwj20,lisiheng21}@mails.tsinghua.edu.cn,{evanzhye,zacharyzhao,jaxzheng,yefengzheng}@tencent.com
{zijingou.mail,97liuyi}@gmail.com,bang.liu@umontreal.ca,yang.yujiu@sz.tsinghua.edu.cn
ABSTRACT
Chinese Spelling Correction (CSC) is gaining increasing attention
due to its promise of automatically detecting and correcting spelling
errors in Chinese texts. Despite its extensive use in many applica-
tions, like search engines and optical character recognition systems,
little has been explored in medical scenarios in which complex
and uncommon medical entities are easily misspelled. Correct-
ing the misspellings of medical entities is arguably more dicult
than those in the open domain due to its requirements of specic
domain knowledge. In this work, we dene the task of Medical-
domain Chinese Spelling Correction and propose MCSCSet, a large-
scale specialist-annotated dataset that contains about 200k samples.
In contrast to the existing open-domain CSC datasets, MCSCSet
involves: i) extensive real-world medical queries collected from
Tencent Yidian, ii) corresponding misspelled sentences manually
annotated by medical specialists. To ensure automated dataset cu-
ration, MCSCSet further oers a medical confusion set consisting
of the commonly misspelled characters of given Chinese medical
terms. This enables one to create the medical misspelling dataset
automatically. Extensive empirical studies have shown signicant
performance gaps between the open-domain and medical-domain
spelling correction, highlighting the need to develop high-quality
datasets that allow for Chinese spelling correction in specic do-
mains. Moreover, our work benchmarks several representative Chi-
nese spelling correction models, establishing baselines for future
work.
1 INTRODUCTION
Misspelled characters frequently occur in hand-crafted Chinese sen-
tences, easily leading to a wrong understanding of these sentences.
To this end, we need a corrector to automatically detect and correct
spelling mistakes in the text. The task of Chinese Spelling Correc-
tion (CSC) is to design such a corrector to correct spelling errors,
which plays a vital role in various Natural Language Processing
(NLP) applications such as search engine [
19
] and optical character
recognition system [
1
]. To achieve the goal of ecient error cor-
rection, previous work has mainly focused on designing advanced
error correction models [
3
,
11
,
36
,
40
] and establishing canonical
benchmark spelling correction corpora [
17
,
25
,
30
,
37
]. For example,
a well-known open-domain spelling correction corpus, SIGHAN-
15 [
25
], is a Chinese spelling correction corpus collected from a
computer-based Test of Chinese as a Foreign Language (TOCFL).
Although these models and benchmark datasets provide people
with high-quality spelling error correction services in the open
domain, their eectivenss is reduced signicantly in some specic
domains, such as the medical domain. The reason is that open-
domain corpora do not contain complex medical terms, and the
spelling of medical terms requires specialized domain knowledge
that ordinary people usually lack [21, 42].
Additionally, Chinese spelling correction for medical terms plays
a crucial role in promoting the standardization and healthy devel-
opment of the medical eld [
43
]. Indeed, Chinese spelling cor-
rectors may improve the quality of medical application services,
especially medical entity search systems, by automatically correct-
ing medical terms with misspellings. Specically, we get incorrect
answers when using the medical entity query system to query med-
ical terms with misspellings, leading to user misunderstandings
and even severe medical malpractice. For example, when a doctor’s
hand-crafted electronic medical record contains misspellings for a
malignant disease, a patient queries the term and may get results
that misidentify themselves as having another illness, delaying pa-
tient care, and aecting a healthy doctor-patient relationship. This
indicates that the spelling error in the medical eld, especially in
the medical entity query scenario, need to be corrected and resolved
urgently [
35
]. Therefore, we need to nd an eective way to correct
spelling mistakes in the medical domain.
To achieve this goal, a straightforward method is to directly
apply advanced methods [
10
,
12
,
17
,
33
,
39
] in open-domain CSC to
medical-domain CSC. However, such a method is likely to fail on
the medical CSC task due to the oset of the corresponding domain
knowledge. To verify this, we choose an advanced BERT-based CSC
model [
15
], which is rst pre-trained on large-scale automatically-
generated CSC data and then ne-tuned on SIGHAN-15. Then we
validate the model on the test sets of SIGHAN-15 and our proposed
medical-domain dataset in this paper. The experimental results are
shown in Table 1, and it can be seen that such a naive method
shows a signicant performance gap between in-domain and out-
of-domain experiments. We conjecture that this is because the
distribution of spelling errors diers signicantly between an open
domain and a specic domain. For instance, in Chinese medical
texts, the vast majority of spelling errors occur in those complex
and uncommon medical entities, which rarely occur in the open-
domain Chinese texts, e.g.,SIGHAN-15, which is collected from
TOCFL. In particular, we summarize the errors of medical terms
1
arXiv:2210.11720v1 [cs.CL] 21 Oct 2022
Conference’17, July 2017, Washington, DC, USA Wangjie Jiang et al.
Table 1: Performance of a well-trained open-domain BERT-based CSC model on detection-level and correction-level tasks.
Specically, the model is rst pre-trained on large-scale automatically-generated data [15] and then ne-tuned on SIGHAN-15
[25] . We report the model’s performances on test sets of SIGHAN-15 and the proposed MCSCSet, respectively.
Test Set Detection-level Correction-level
Prec. (%) Rec. (%) F1 (%) Prec. (%) Rec. (%) F1 (%)
SIGHAN-15 79.06 83.73 81.33 77.31 81.89 79.53
MCSCSet 43.83 38.94 41.24 28.58 25.38 26.89
Table 2: Examples of typical Chinese medical entity errors,
which can be mainly divided into ve categories: i) phono-
logical errors, ii) visual errors, iii) order-confused errors;
iv) repeated characters, and v) missing characters. Among
the ve categories, phonological and visual errors belong to
spelling errors, which are the focus of our study. Erroneous
characters are marked in red, and the corresponding phon-
ics are given in brackets.
Type Sentence Correction
如何闭(bi)孕避(bi)孕
Phonological how to close pregnancy contraception
Visual 胰岛素应该用水箱储存吗冰箱
should insulin stored in water tank refrigerator
如何处理蜂蜜蛰伤蜜蜂
Order-confused how to deal with honey stings bee
Redundant 天花粉的症状天花
symptoms of trichosanthin smallpox
糖尿病患者能服用葡萄吗葡萄糖
Missing can diabetics take grapes glucose
into ve categories of which the phonological errors and the visual
errors belong to spelling errors, and show their corresponding
examples in Table 2. We can observe from the table that the errors
in the medical domain are not common in the open domain, which
highlights the need to develop high-quality datasets that allow for
medical-domain Chinese spelling correction.
Here, we highlight the challenges of building a large-scale Chi-
nese spelling correction benchmark dataset in the medical domain
as follows:
(C1) Diculty to Collect Real Data:
To be able to provide the
service of medical entity error correction in real application scenar-
ios, annotated datasets must come from real medical scenarios and
contain common error-prone medical entities among the hundreds
of millions of queries generated by real-world applications.
(C2) High Demand of Medical Knowledge: To produce a high-
quality medical term (or entity) spelling correction corpus, annota-
tors are required to master specic medical knowledge and maintain
high correction quality, which is a challenging and time-consuming
task.
To address the above challenges, in this paper, we present Medi-
cal Chinese Spelling Correction Dataset (MCSCSet), a large-scale and
specialist-annotated dataset for Chinese spelling correction in the
medical domain. Notably, we collect a large-scale query log dataset
from a real-world medical application named Tencent Yidian
1
and
1https://baike.qq.com/
construct a manually annotated dataset with about 200k samples,
in which each sample consists of a correct medical query and its
corresponding wrong medical query with spelling errors. MCSCSet
also provides a medical confusion set, consisting of a large number
of error-prone characters from Chinese medical terminologies, each
with its corresponding erroneous characters. This enables potential
researchers or practitioners to generate new medical-domain CSC
datasets based on their specic needs by simply replacing the med-
ical entities with misspelled characters dened in the confusion set.
To distinguish from the open-domain CSC, we further provide a for-
mal denition of the medical-domain Chinese spelling correction
task, mainly focusing on the spelling error correction for medical
entities. Moreover, our work benchmarks several Chinese spelling
correction models for future comparisons. Overall, the following
components summarize our major contributions:
•Practical Task Denition of Medical-domain Chinese
Spelling Correction:
We formally dene the Chinese spelling
correction task in the medical domain for the rst time,
which applies to all tasks involving user input such as search,
question answering, and translation.
•First CSC Dataset for Medical Domain:
We provide the
rst Chinese medical spelling correction dataset from the
large-scale healthcare encyclopedia software Tencent Yidian,
based on the annotation of medical specialists.
•Rich Medical Confusion Set:
We present a corresponding
medical confusion set, which consists of abundant error-
prone medical entities. This allows great exibility for future
usage since one could exploit it to construct a new dataset.
•Rigorous Medical-domain CSC Benchmarking:
We bench-
mark four representative Chinese spelling correction models,
which verify the quality of the proposed MCSCSet dataset
and provide reproducible comparisons for future studies.
Paper Organizations.
Section 2 presents background and related
work on Chinese spelling correction, including previous CSC al-
gorithms, datasets, and benchmarks. In Section 3, we present the
denition of the problem of medical-domain Chinese spelling cor-
rection. In Section 4 we provide details on the construction pro-
cess of the MCSCSet dataset and present some statistical analysis.
Section 5 provides specics of benchmarking representative CSC
algorithms, implementation details and experimental results. Lastly,
Section 6 discusses and concludes the paper.
2 RELATED WORK
Chinese Spelling Correction.
Chinese Spelling Correction (CSC)
is a challenging task in Natural Language Processing (NLP) and
plays an important part in various real-world applications, such
2
摘要:
展开>>
收起<<
MCSCSet:ASpecialist-annotatedDatasetforMedical-domainChineseSpellingCorrectionWangjieJiang†,ZhihaoYe‡,ZijingOu♠,RuihuiZhao‡,JianguangZheng‡,YiLiu‡,SihengLi†,BangLiu♣,YujiuYang†andYefengZheng‡†TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity,Shenzhen,China‡TencentJarvisLab,Shenzhen,Chin...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源
价格:10玖币
属性:9 页
大小:1.27MB
格式:PDF
时间:2025-05-02


渝公网安备50010702506394