MCSCSet A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction

2025-05-02 0 0 1.27MB 9 页 10玖币
侵权投诉
MCSCSet: A Specialist-annotated Dataset for Medical-domain
Chinese Spelling Correction
Wangjie Jiang, Zhihao Ye, Zijing Ou, Ruihui Zhao, Jianguang Zheng, Yi Liu,
Siheng Li, Bang Liu, Yujiu Yangand Yefeng Zheng
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Tencent Jarvis Lab, Shenzhen, China
Sun Yat-sen University, Guangzhou, China
Université de Montréal Mila & CIFAR, Québec, Canada
{jwj20,lisiheng21}@mails.tsinghua.edu.cn,{evanzhye,zacharyzhao,jaxzheng,yefengzheng}@tencent.com
{zijingou.mail,97liuyi}@gmail.com,bang.liu@umontreal.ca,yang.yujiu@sz.tsinghua.edu.cn
ABSTRACT
Chinese Spelling Correction (CSC) is gaining increasing attention
due to its promise of automatically detecting and correcting spelling
errors in Chinese texts. Despite its extensive use in many applica-
tions, like search engines and optical character recognition systems,
little has been explored in medical scenarios in which complex
and uncommon medical entities are easily misspelled. Correct-
ing the misspellings of medical entities is arguably more dicult
than those in the open domain due to its requirements of specic
domain knowledge. In this work, we dene the task of Medical-
domain Chinese Spelling Correction and propose MCSCSet, a large-
scale specialist-annotated dataset that contains about 200k samples.
In contrast to the existing open-domain CSC datasets, MCSCSet
involves: i) extensive real-world medical queries collected from
Tencent Yidian, ii) corresponding misspelled sentences manually
annotated by medical specialists. To ensure automated dataset cu-
ration, MCSCSet further oers a medical confusion set consisting
of the commonly misspelled characters of given Chinese medical
terms. This enables one to create the medical misspelling dataset
automatically. Extensive empirical studies have shown signicant
performance gaps between the open-domain and medical-domain
spelling correction, highlighting the need to develop high-quality
datasets that allow for Chinese spelling correction in specic do-
mains. Moreover, our work benchmarks several representative Chi-
nese spelling correction models, establishing baselines for future
work.
1 INTRODUCTION
Misspelled characters frequently occur in hand-crafted Chinese sen-
tences, easily leading to a wrong understanding of these sentences.
To this end, we need a corrector to automatically detect and correct
spelling mistakes in the text. The task of Chinese Spelling Correc-
tion (CSC) is to design such a corrector to correct spelling errors,
which plays a vital role in various Natural Language Processing
(NLP) applications such as search engine [
19
] and optical character
recognition system [
1
]. To achieve the goal of ecient error cor-
rection, previous work has mainly focused on designing advanced
error correction models [
3
,
11
,
36
,
40
] and establishing canonical
benchmark spelling correction corpora [
17
,
25
,
30
,
37
]. For example,
a well-known open-domain spelling correction corpus, SIGHAN-
15 [
25
], is a Chinese spelling correction corpus collected from a
computer-based Test of Chinese as a Foreign Language (TOCFL).
Although these models and benchmark datasets provide people
with high-quality spelling error correction services in the open
domain, their eectivenss is reduced signicantly in some specic
domains, such as the medical domain. The reason is that open-
domain corpora do not contain complex medical terms, and the
spelling of medical terms requires specialized domain knowledge
that ordinary people usually lack [21, 42].
Additionally, Chinese spelling correction for medical terms plays
a crucial role in promoting the standardization and healthy devel-
opment of the medical eld [
43
]. Indeed, Chinese spelling cor-
rectors may improve the quality of medical application services,
especially medical entity search systems, by automatically correct-
ing medical terms with misspellings. Specically, we get incorrect
answers when using the medical entity query system to query med-
ical terms with misspellings, leading to user misunderstandings
and even severe medical malpractice. For example, when a doctor’s
hand-crafted electronic medical record contains misspellings for a
malignant disease, a patient queries the term and may get results
that misidentify themselves as having another illness, delaying pa-
tient care, and aecting a healthy doctor-patient relationship. This
indicates that the spelling error in the medical eld, especially in
the medical entity query scenario, need to be corrected and resolved
urgently [
35
]. Therefore, we need to nd an eective way to correct
spelling mistakes in the medical domain.
To achieve this goal, a straightforward method is to directly
apply advanced methods [
10
,
12
,
17
,
33
,
39
] in open-domain CSC to
medical-domain CSC. However, such a method is likely to fail on
the medical CSC task due to the oset of the corresponding domain
knowledge. To verify this, we choose an advanced BERT-based CSC
model [
15
], which is rst pre-trained on large-scale automatically-
generated CSC data and then ne-tuned on SIGHAN-15. Then we
validate the model on the test sets of SIGHAN-15 and our proposed
medical-domain dataset in this paper. The experimental results are
shown in Table 1, and it can be seen that such a naive method
shows a signicant performance gap between in-domain and out-
of-domain experiments. We conjecture that this is because the
distribution of spelling errors diers signicantly between an open
domain and a specic domain. For instance, in Chinese medical
texts, the vast majority of spelling errors occur in those complex
and uncommon medical entities, which rarely occur in the open-
domain Chinese texts, e.g.,SIGHAN-15, which is collected from
TOCFL. In particular, we summarize the errors of medical terms
1
arXiv:2210.11720v1 [cs.CL] 21 Oct 2022
Conference’17, July 2017, Washington, DC, USA Wangjie Jiang et al.
Table 1: Performance of a well-trained open-domain BERT-based CSC model on detection-level and correction-level tasks.
Specically, the model is rst pre-trained on large-scale automatically-generated data [15] and then ne-tuned on SIGHAN-15
[25] . We report the model’s performances on test sets of SIGHAN-15 and the proposed MCSCSet, respectively.
Test Set Detection-level Correction-level
Prec. (%) Rec. (%) F1 (%) Prec. (%) Rec. (%) F1 (%)
SIGHAN-15 79.06 83.73 81.33 77.31 81.89 79.53
MCSCSet 43.83 38.94 41.24 28.58 25.38 26.89
Table 2: Examples of typical Chinese medical entity errors,
which can be mainly divided into ve categories: i) phono-
logical errors, ii) visual errors, iii) order-confused errors;
iv) repeated characters, and v) missing characters. Among
the ve categories, phonological and visual errors belong to
spelling errors, which are the focus of our study. Erroneous
characters are marked in red, and the corresponding phon-
ics are given in brackets.
Type Sentence Correction
(bi)(bi)
Phonological how to close pregnancy contraception
Visual
should insulin stored in water tank refrigerator
蜂蜜蜜蜂
Order-confused how to deal with honey stings bee
Redundant
symptoms of trichosanthin smallpox
尿者能葡萄葡萄
Missing can diabetics take grapes glucose
into ve categories of which the phonological errors and the visual
errors belong to spelling errors, and show their corresponding
examples in Table 2. We can observe from the table that the errors
in the medical domain are not common in the open domain, which
highlights the need to develop high-quality datasets that allow for
medical-domain Chinese spelling correction.
Here, we highlight the challenges of building a large-scale Chi-
nese spelling correction benchmark dataset in the medical domain
as follows:
(C1) Diculty to Collect Real Data:
To be able to provide the
service of medical entity error correction in real application scenar-
ios, annotated datasets must come from real medical scenarios and
contain common error-prone medical entities among the hundreds
of millions of queries generated by real-world applications.
(C2) High Demand of Medical Knowledge: To produce a high-
quality medical term (or entity) spelling correction corpus, annota-
tors are required to master specic medical knowledge and maintain
high correction quality, which is a challenging and time-consuming
task.
To address the above challenges, in this paper, we present Medi-
cal Chinese Spelling Correction Dataset (MCSCSet), a large-scale and
specialist-annotated dataset for Chinese spelling correction in the
medical domain. Notably, we collect a large-scale query log dataset
from a real-world medical application named Tencent Yidian
1
and
1https://baike.qq.com/
construct a manually annotated dataset with about 200k samples,
in which each sample consists of a correct medical query and its
corresponding wrong medical query with spelling errors. MCSCSet
also provides a medical confusion set, consisting of a large number
of error-prone characters from Chinese medical terminologies, each
with its corresponding erroneous characters. This enables potential
researchers or practitioners to generate new medical-domain CSC
datasets based on their specic needs by simply replacing the med-
ical entities with misspelled characters dened in the confusion set.
To distinguish from the open-domain CSC, we further provide a for-
mal denition of the medical-domain Chinese spelling correction
task, mainly focusing on the spelling error correction for medical
entities. Moreover, our work benchmarks several Chinese spelling
correction models for future comparisons. Overall, the following
components summarize our major contributions:
Practical Task Denition of Medical-domain Chinese
Spelling Correction:
We formally dene the Chinese spelling
correction task in the medical domain for the rst time,
which applies to all tasks involving user input such as search,
question answering, and translation.
First CSC Dataset for Medical Domain:
We provide the
rst Chinese medical spelling correction dataset from the
large-scale healthcare encyclopedia software Tencent Yidian,
based on the annotation of medical specialists.
Rich Medical Confusion Set:
We present a corresponding
medical confusion set, which consists of abundant error-
prone medical entities. This allows great exibility for future
usage since one could exploit it to construct a new dataset.
Rigorous Medical-domain CSC Benchmarking:
We bench-
mark four representative Chinese spelling correction models,
which verify the quality of the proposed MCSCSet dataset
and provide reproducible comparisons for future studies.
Paper Organizations.
Section 2 presents background and related
work on Chinese spelling correction, including previous CSC al-
gorithms, datasets, and benchmarks. In Section 3, we present the
denition of the problem of medical-domain Chinese spelling cor-
rection. In Section 4 we provide details on the construction pro-
cess of the MCSCSet dataset and present some statistical analysis.
Section 5 provides specics of benchmarking representative CSC
algorithms, implementation details and experimental results. Lastly,
Section 6 discusses and concludes the paper.
2 RELATED WORK
Chinese Spelling Correction.
Chinese Spelling Correction (CSC)
is a challenging task in Natural Language Processing (NLP) and
plays an important part in various real-world applications, such
2
摘要:

MCSCSet:ASpecialist-annotatedDatasetforMedical-domainChineseSpellingCorrectionWangjieJiang†,ZhihaoYe‡,ZijingOu♠,RuihuiZhao‡,JianguangZheng‡,YiLiu‡,SihengLi†,BangLiu♣,YujiuYang†andYefengZheng‡†TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity,Shenzhen,China‡TencentJarvisLab,Shenzhen,Chin...

展开>> 收起<<
MCSCSet A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.27MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注