A Chinese Spelling Check Framework Based on Reverse Contrastive Learning

2025-04-30 0 0 632.99KB 11 页 10玖币

侵权投诉

A Chinese Spelling Check Framework Based on Reverse

Contrastive Learning

Nankai Lina,Hongyan Wub,Sihui Fub,Shengyi Jiangb,c,∗and Aimin Yanga,∗

aSchool of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510000, China

bSchool of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, 510000, China

cGuangzhou Key Laboratory of Multilingual Intelligent Processing, Guangzhou, 510000, China

ARTICLE INFO

Keywords:

Chinese Spelling Check

Reverse Contrastive Learning

Confusable Characters

Model-agnostic

ABSTRACT

Chinese spelling check is a task to detect and correct spelling mistakes in Chinese text. Existing

research aims to enhance the text representation and use multi-source information to improve the

detection and correction capabilities of models, but does not pay too much attention to improving

their ability to distinguish between confusable words. Contrastive learning, whose aim is to

minimize the distance in representation space between similar sample pairs, has recently become

a dominant technique in natural language processing. Inspired by contrastive learning, we present

a novel framework for Chinese spelling checking, which consists of three modules: language

representation, spelling check and reverse contrastive learning. Speciﬁcally, we propose a reverse

contrastive learning strategy, which explicitly forces the model to minimize the agreement

between the similar examples, namely, the phonetically and visually confusable characters.

Experimental results show that our framework is model-agnostic and could be combined with

existing Chinese spelling check models to yield state-of-the-art performance.

1. Introduction

Chinese spelling check (CSC) is an important natural language processing (NLP) task which lays the foundation for

many NLP downstream applications, such as optical character recognition (OCR) (Wang, Song, Li, Han and Zhang,

2018;Hong, Yu, He, Liu and Liu,2019) or automated essay scoring(Uto, Xie and Ueno,2020). Meanwhile, it is a

challenging task which demands the competence comparable to humans, in natural language understanding (Liu, Lai,

Chuang and Lee,2010;Liu, Cheng, Luo, Duh and Matsumoto,2013;Xin, Zhao, Wang and Jia,2014). Most recent

successes on this task are achieved by the non-autoregressive models like BERT, since the length of the output needs

to be exactly the same as that of the input, and at the same time each character in the source sequence should share the

same position with its counterpart in the target.

As to the classiﬁcation of Chinese characters, some of them are hieroglyphs while most of them are semantic-

phonetic compound characters (Norman,1988). Consequently, though it may be impossible to enumerate all spelling

mistakes, the error patterns could still be roughly summarized as visual or phonetic errors (Chang,1995). Actually,

according to statistics, over 80% of all spelling mistakes are related to the phonetic resemblance between the characters.

If a CSC model could well distinguish between the phonetically and visually confusable characters, it would be of great

help to improving its performance in correcting spelling errors. However, currently few methods consider making use

of the phonetic and visual information to help tackle the confusable character issue.

On the other hand, self-supervised representation learning has signiﬁcantly advanced due to the application of

contrastive learning (Chen, Kornblith, Norouzi and Hinton,2020;Henaﬀ,2020;Oord, Li and Vinyals,2018;Wu,

Xiong, Yu and Lin,2018), whose main idea is to train a model to maximize the agreement between a target example

(“anchor”) and a similar (“positive”) example in embedding space, while also maximize the disagreement between

this target and other dissimilar (“negative”) examples. Khosla, Teterwak, Wang, Sarna, Tian, Isola, Maschinot, Liu

and Krishnan (2020) ﬁrst allowed contrastive learning to be applied in the fully-supervised setting, namely, Supervised

Contrastive Learning (SCL), which could eﬀectively leverage label information to distinguish the positive and negative

examples of an anchor. Contrastive learning has brought improvements to many NLP tasks, including aspect sentiment

classiﬁcation (Ke, Liu, Xu and Shu,2021), text classiﬁcation (Suresh and Ong,2021) and semantic textual similarity

∗Corresponding author

neakail@outlook.com (N. Lin); jiangshengyi@163.com (S. Jiang); amyang@gdut.edu.cn (A. Yang)

ORCID(s): 0000-0003-2838-8273 (N. Lin)

Nankai Lin et al.: Preprint submitted to Elsevier Page 1 of 10

arXiv:2210.13823v2 [cs.CL] 6 Jul 2023

A Chinese Spelling Check Framework

(Gao, Yao and Chen,2021). With respect to CSC, Li, Zhou, Li, Li, Liu, Sun, Wang, Li, Cao and Zheng (2022c) are

the ﬁrst to employ constrastive learning. They proposed an error-driven method to guide the model to learn to tell right

and wrong answers.

Enlightened by contrastive learning, in this study we present a simple yet eﬀective CSC framework designed to

enhance the performance of existing CSC models. While the objective of contrastive learning is to pull together the

similar examples, we creatively propose to pull apart those phonetically or visually similar characters, which would

lend a helping hand to the models in distinguishing between confusable characters.

Our contributions could be summarized as follows:

(1) We entend the idea of contrastive learning to CSC task and propose Reverse Contrastive Learning (RCL)

strategy which results in models that better detect and correct spelling errors related to confusable characters.

(2) We put forward a model-agnostic CSC framework, in which our RCL strategy, as a subsidiary component, could

be easily combined with existing CSC models to yield better performance.

(3) The CSC models equipped with our strategy establish new state-of-the-art on SIGHAN benchmarks.

2. Related Work

2.1. Chinese Spelling Check

In recent years, the research on CSC has been mainly working on twofold: data generation for CSC and CSC-

oriented language models.

Data generation for CSC. Wang et al. (2018) proposed a method to automatically construct a CSC corpus, which

generates visually or phonologically similar characters based on OCR and ASR recognition technology, respectively,

thereby greatly expanding the scale of the corpus. Duan, Pan, Wang, Zhang and Wu (2019) built two corpora: a visual

corpus and phonological one. While the construction of both corpora makes use of ASR technology, the phonological

one also uses the conversion between Chinese characters and the sounds of the characters.

CSC-oriented language model. Recently, researchers have mainly focused on employing language models to

capture information in terms of character similarity and phonological similarity, facilitating the CSC task. The models

are dominated by neural network-based models (Cheng, Xu, Chen, Jiang, Wang, Wang, Chu and Qi,2020;Ma, Hu,

Peng, Zheng and Xu,2023) , especially pre-trained language models. Related studies have explored the potential

semantic modeling capability of pre-trained language models, with BERT being widely utilized as the backbone of

CSC models. However, the methods may lead to overﬁtting of CSC models to the errors in sentences and disturbance

of semantic encoding of sentences, yielding poor generalization (Zhang, Zheng, Yan and Qiu,2022b;Wu, Zhang,

Zhang and Zhao,2023). To address the issues,Yang (2023) proposed the n-gram masking layer to alleviate common

label leakage and error disturbance problems. Regarding another line of pre-trained models involving the fusion

of textual, visual and phonetic information into pre-trained models, Wang, Che, Wu, Wang, Hu and Liu (2021)

presented the Dynamic Connected Networks (DCN) based on the non-autoregressive model, which utilizes a Pinyin

enhanced candidate generator to generate candidate Chinese characters and then models the dependencies between two

neighboring Chinese characters using an attention-based network. SCOPE (Li, Wang, Mao, Guo, Yang and Zhang,

2022a) enhances the performance of the CSC model by imposing an auxiliary pronunciation prediction task and

devising an iterative inference strategy, similar CSC models based on multimodal information as SpellBERT (Ji,

Yan and Qiu,2021), PLOME (Liu, Yang, Yue, Zhang and Wang,2021) and ReaLiSe (Xu, Li, Zhou, Li, Wang, Cao,

Huang and Mao,2021). Despite the eﬀectiveness of multimodal information, there are still some inherent problems.

Speciﬁcally, given that direct integration of phonetic information may aﬀect the raw text representation and weaken the

eﬀect of phonetic information, Liang, Quan and Wang (2023) decoupled text and phonetic representation and designed

a pinyin-to-character pre-training task to enhance the eﬀect of phonetic knowledge, simultaneously introducing a

self-distillation module to prevent the model from overﬁtting phonetic information. Moreover, to address the issue of

characters’ relative positions not being well-aligned in multimodal spaces, Liang, Huang, Li and Shi (2022) presented

distance fusion and knowledge enhanced framework for CSC to capture relative distances between characters and

reconstruct a semantic representation, simultaneously reducing errors caused by unseen ﬁxed expressions.

The aforesaid models introduce character similarity and phonological similarity information from the confusion

set, neglecting contextual similarity, which is demonstrated more valuable in the research conducted by Zhang, Li,

Zhou, Ma, Li, Cao and Zheng (2023), where a simple yet eﬀective curriculum learning framework is designed to guide

CSC models explicitly to capture the contextual similarity between Chinese characters.

Nankai Lin et al.: Preprint submitted to Elsevier Page 2 of 10

A Chinese Spelling Check Framework

Figure 1: The proposed Chinses spelling check framework.

2.2. Contrastive Learning for Chinese Spelling Check

Although contrastive learning has promoted various NLP applications, directly applying it to CSC tasks has

limitations. One diﬃculty lies in constructing suitable examples using data augmentation or existing labels. Zhang, Yan,

Yu and Qiu (2022a) employed a self-distillation method to construct positive samples for contrastive learning, which

uniformed the hidden states of erroneous tokens to be closer to the corresponding correct tokens to learn better feature

representation. LEAD framework (Li, Ma, Zhou, Li, Li, Huang, Liu, Li, Cao and Zheng,2022b) guides the CSC models

to learn better phonetics, vision and deﬁnition knowledge from a dictionary exploiting uniﬁed contrastive learning,

where positive and negative samples are constructed based on the information of character phonetics, glyphs, and

deﬁnitions from external dictionary. Li et al. (2022c) proposed the Error-driven Contrastive Probability Optimization

(ECOPO) framework. ECOPO optimizes the knowledge representation of the pre-trained model and guides the model

to avoid predicting these common features in an error-driven manner. Nevertheless, the method of learning from

dictionary leverages the heterogeneous knowledge from a external large-scale dictionary, increasing the cost of training.

ECOPO relies on the output of the model to generate negative samples, and hence the quality of these samples is subject

to the performance of the model. Unlike the methods, we attempt to adopt the pinyin information from the batch and

confusion set to highly eﬃciently construct the negative samples required for contrastive learning, so as to present

accurate contrastive information to the model.

3. The Chinses spelling check framework

In this work, we treat spelling check as a non-autoregressive task. Given an input text sequence 𝑋=

{𝑥1, 𝑥2, . . . , 𝑥𝑁}and its corresponding Pinyin (Mandarin phonetic transcription) sequence 𝑦= {𝑦1, 𝑦2, . . . , 𝑦𝑁}(where

𝑁notates the length of the sequence), the ultimate goal is to automatically detect the incorrect characters in the

sequence and then output the rectiﬁed target sequence 𝑍= {𝑧1, 𝑧2, . . . , 𝑧𝑁}. The proposed framework for CSC consists

of three modules: language representation, spelling correctness and reverse contrastive learning (See Figure 1).

3.1. Language representation module

In the language representation module, we adopt a pretrained model to encode the input. Trained on huge

unannotated text data, the pretrained model could memorize the language regularity in its parameters and hence oﬀer

rich contextual representations to the downstream modules. Given an input sequence 𝑋, the model would ﬁrst project

Nankai Lin et al.: Preprint submitted to Elsevier Page 3 of 10

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AChineseSpellingCheckFrameworkBasedonReverseContrastiveLearningNankaiLina,HongyanWub,SihuiFub,ShengyiJiangb,c,∗andAiminYanga,∗aSchoolofComputerScienceandTechnology,GuangdongUniversityofTechnology,Guangzhou,510000,ChinabSchoolofInformationScienceandTechnology,GuangdongUniversityofForeignStudies,Guang...

展开>> 收起<<

A Chinese Spelling Check Framework Based on Reverse Contrastive Learning.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Chinese Spelling Check Framework Based on Reverse Contrastive Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: