MCSCSet A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction

2025-05-02 1 0 1.27MB 9 页 10玖币

侵权投诉

MCSCSet: A Specialist-annotated Dataset for Medical-domain

Chinese Spelling Correction

Wangjie Jiang†, Zhihao Ye‡, Zijing Ou♠, Ruihui Zhao‡, Jianguang Zheng‡, Yi Liu‡,

Siheng Li†, Bang Liu♣, Yujiu Yang†and Yefeng Zheng‡

†Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China

‡Tencent Jarvis Lab, Shenzhen, China

♠Sun Yat-sen University, Guangzhou, China

♣Université de Montréal Mila & CIFAR, Québec, Canada

{jwj20,lisiheng21}@mails.tsinghua.edu.cn,{evanzhye,zacharyzhao,jaxzheng,yefengzheng}@tencent.com

{zijingou.mail,97liuyi}@gmail.com,bang.liu@umontreal.ca,yang.yujiu@sz.tsinghua.edu.cn

ABSTRACT

Chinese Spelling Correction (CSC) is gaining increasing attention

due to its promise of automatically detecting and correcting spelling

errors in Chinese texts. Despite its extensive use in many applica-

tions, like search engines and optical character recognition systems,

little has been explored in medical scenarios in which complex

and uncommon medical entities are easily misspelled. Correct-

ing the misspellings of medical entities is arguably more dicult

than those in the open domain due to its requirements of specic

domain knowledge. In this work, we dene the task of Medical-

domain Chinese Spelling Correction and propose MCSCSet, a large-

scale specialist-annotated dataset that contains about 200k samples.

In contrast to the existing open-domain CSC datasets, MCSCSet

involves: i) extensive real-world medical queries collected from

Tencent Yidian, ii) corresponding misspelled sentences manually

annotated by medical specialists. To ensure automated dataset cu-

ration, MCSCSet further oers a medical confusion set consisting

of the commonly misspelled characters of given Chinese medical

terms. This enables one to create the medical misspelling dataset

automatically. Extensive empirical studies have shown signicant

performance gaps between the open-domain and medical-domain

spelling correction, highlighting the need to develop high-quality

datasets that allow for Chinese spelling correction in specic do-

mains. Moreover, our work benchmarks several representative Chi-

nese spelling correction models, establishing baselines for future

work.

1 INTRODUCTION

Misspelled characters frequently occur in hand-crafted Chinese sen-

tences, easily leading to a wrong understanding of these sentences.

To this end, we need a corrector to automatically detect and correct

spelling mistakes in the text. The task of Chinese Spelling Correc-

tion (CSC) is to design such a corrector to correct spelling errors,

which plays a vital role in various Natural Language Processing

(NLP) applications such as search engine [

] and optical character

recognition system [

]. To achieve the goal of ecient error cor-

rection, previous work has mainly focused on designing advanced

error correction models [

] and establishing canonical

benchmark spelling correction corpora [

]. For example,

a well-known open-domain spelling correction corpus, SIGHAN-

15 [

], is a Chinese spelling correction corpus collected from a

computer-based Test of Chinese as a Foreign Language (TOCFL).

Although these models and benchmark datasets provide people

with high-quality spelling error correction services in the open

domain, their eectivenss is reduced signicantly in some specic

domains, such as the medical domain. The reason is that open-

domain corpora do not contain complex medical terms, and the

spelling of medical terms requires specialized domain knowledge

that ordinary people usually lack [21, 42].

Additionally, Chinese spelling correction for medical terms plays

a crucial role in promoting the standardization and healthy devel-

opment of the medical eld [

]. Indeed, Chinese spelling cor-

rectors may improve the quality of medical application services,

especially medical entity search systems, by automatically correct-

ing medical terms with misspellings. Specically, we get incorrect

answers when using the medical entity query system to query med-

ical terms with misspellings, leading to user misunderstandings

and even severe medical malpractice. For example, when a doctor’s

hand-crafted electronic medical record contains misspellings for a

malignant disease, a patient queries the term and may get results

that misidentify themselves as having another illness, delaying pa-

tient care, and aecting a healthy doctor-patient relationship. This

indicates that the spelling error in the medical eld, especially in

the medical entity query scenario, need to be corrected and resolved

urgently [

]. Therefore, we need to nd an eective way to correct

spelling mistakes in the medical domain.

To achieve this goal, a straightforward method is to directly

apply advanced methods [

] in open-domain CSC to

medical-domain CSC. However, such a method is likely to fail on

the medical CSC task due to the oset of the corresponding domain

knowledge. To verify this, we choose an advanced BERT-based CSC

model [

], which is rst pre-trained on large-scale automatically-

generated CSC data and then ne-tuned on SIGHAN-15. Then we

validate the model on the test sets of SIGHAN-15 and our proposed

medical-domain dataset in this paper. The experimental results are

shown in Table 1, and it can be seen that such a naive method

shows a signicant performance gap between in-domain and out-

of-domain experiments. We conjecture that this is because the

distribution of spelling errors diers signicantly between an open

domain and a specic domain. For instance, in Chinese medical

texts, the vast majority of spelling errors occur in those complex

and uncommon medical entities, which rarely occur in the open-

domain Chinese texts, e.g.,SIGHAN-15, which is collected from

TOCFL. In particular, we summarize the errors of medical terms

arXiv:2210.11720v1 [cs.CL] 21 Oct 2022

Conference’17, July 2017, Washington, DC, USA Wangjie Jiang et al.

Table 1: Performance of a well-trained open-domain BERT-based CSC model on detection-level and correction-level tasks.

Specically, the model is rst pre-trained on large-scale automatically-generated data [15] and then ne-tuned on SIGHAN-15

[25] . We report the model’s performances on test sets of SIGHAN-15 and the proposed MCSCSet, respectively.

Test Set Detection-level Correction-level

Prec. (%) Rec. (%) F1 (%) Prec. (%) Rec. (%) F1 (%)

SIGHAN-15 79.06 83.73 81.33 77.31 81.89 79.53

MCSCSet 43.83 38.94 41.24 28.58 25.38 26.89

Table 2: Examples of typical Chinese medical entity errors,

which can be mainly divided into ve categories: i) phono-

logical errors, ii) visual errors, iii) order-confused errors;

iv) repeated characters, and v) missing characters. Among

the ve categories, phonological and visual errors belong to

spelling errors, which are the focus of our study. Erroneous

characters are marked in red, and the corresponding phon-

ics are given in brackets.

Type Sentence Correction

如何闭(bi)孕避(bi)孕

Phonological how to close pregnancy contraception

Visual 胰岛素应该用水箱储存吗冰箱

should insulin stored in water tank refrigerator

如何处理蜂蜜蛰伤蜜蜂

Order-confused how to deal with honey stings bee

Redundant 天花粉的症状天花

symptoms of trichosanthin smallpox

糖尿病患者能服用葡萄吗葡萄糖

Missing can diabetics take grapes glucose

into ve categories of which the phonological errors and the visual

errors belong to spelling errors, and show their corresponding

examples in Table 2. We can observe from the table that the errors

in the medical domain are not common in the open domain, which

highlights the need to develop high-quality datasets that allow for

medical-domain Chinese spelling correction.

Here, we highlight the challenges of building a large-scale Chi-

nese spelling correction benchmark dataset in the medical domain

as follows:

(C1) Diculty to Collect Real Data:

To be able to provide the

service of medical entity error correction in real application scenar-

ios, annotated datasets must come from real medical scenarios and

contain common error-prone medical entities among the hundreds

of millions of queries generated by real-world applications.

(C2) High Demand of Medical Knowledge: To produce a high-

quality medical term (or entity) spelling correction corpus, annota-

tors are required to master specic medical knowledge and maintain

high correction quality, which is a challenging and time-consuming

task.

To address the above challenges, in this paper, we present Medi-

cal Chinese Spelling Correction Dataset (MCSCSet), a large-scale and

specialist-annotated dataset for Chinese spelling correction in the

medical domain. Notably, we collect a large-scale query log dataset

from a real-world medical application named Tencent Yidian

and

1https://baike.qq.com/

construct a manually annotated dataset with about 200k samples,

in which each sample consists of a correct medical query and its

corresponding wrong medical query with spelling errors. MCSCSet

also provides a medical confusion set, consisting of a large number

of error-prone characters from Chinese medical terminologies, each

with its corresponding erroneous characters. This enables potential

researchers or practitioners to generate new medical-domain CSC

datasets based on their specic needs by simply replacing the med-

ical entities with misspelled characters dened in the confusion set.

To distinguish from the open-domain CSC, we further provide a for-

mal denition of the medical-domain Chinese spelling correction

task, mainly focusing on the spelling error correction for medical

entities. Moreover, our work benchmarks several Chinese spelling

correction models for future comparisons. Overall, the following

components summarize our major contributions:

•Practical Task Denition of Medical-domain Chinese

Spelling Correction:

We formally dene the Chinese spelling

correction task in the medical domain for the rst time,

which applies to all tasks involving user input such as search,

question answering, and translation.

•First CSC Dataset for Medical Domain:

We provide the

rst Chinese medical spelling correction dataset from the

large-scale healthcare encyclopedia software Tencent Yidian,

based on the annotation of medical specialists.

•Rich Medical Confusion Set:

We present a corresponding

medical confusion set, which consists of abundant error-

prone medical entities. This allows great exibility for future

usage since one could exploit it to construct a new dataset.

•Rigorous Medical-domain CSC Benchmarking:

We bench-

mark four representative Chinese spelling correction models,

which verify the quality of the proposed MCSCSet dataset

and provide reproducible comparisons for future studies.

Paper Organizations.

Section 2 presents background and related

work on Chinese spelling correction, including previous CSC al-

gorithms, datasets, and benchmarks. In Section 3, we present the

denition of the problem of medical-domain Chinese spelling cor-

rection. In Section 4 we provide details on the construction pro-

cess of the MCSCSet dataset and present some statistical analysis.

Section 5 provides specics of benchmarking representative CSC

algorithms, implementation details and experimental results. Lastly,

Section 6 discusses and concludes the paper.

2 RELATED WORK

Chinese Spelling Correction.

Chinese Spelling Correction (CSC)

is a challenging task in Natural Language Processing (NLP) and

plays an important part in various real-world applications, such

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MCSCSet:ASpecialist-annotatedDatasetforMedical-domainChineseSpellingCorrectionWangjieJiang†,ZhihaoYe‡,ZijingOu♠,RuihuiZhao‡,JianguangZheng‡,YiLiu‡,SihengLi†,BangLiu♣,YujiuYang†andYefengZheng‡†TsinghuaShenzhenInternationalGraduateSchool,TsinghuaUniversity,Shenzhen,China‡TencentJarvisLab,Shenzhen,Chin...

展开>> 收起<<

MCSCSet A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MCSCSet A Specialist-annotated Dataset for Medical-domain Chinese Spelling Correction

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: