A Chinese Spelling Check Framework Based on Reverse Contrastive Learning

2025-04-30 0 0 632.99KB 11 页 10玖币
侵权投诉
A Chinese Spelling Check Framework Based on Reverse
Contrastive Learning
Nankai Lina,Hongyan Wub,Sihui Fub,Shengyi Jiangb,c,and Aimin Yanga,
aSchool of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510000, China
bSchool of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, 510000, China
cGuangzhou Key Laboratory of Multilingual Intelligent Processing, Guangzhou, 510000, China
ARTICLE INFO
Keywords:
Chinese Spelling Check
Reverse Contrastive Learning
Confusable Characters
Model-agnostic
ABSTRACT
Chinese spelling check is a task to detect and correct spelling mistakes in Chinese text. Existing
research aims to enhance the text representation and use multi-source information to improve the
detection and correction capabilities of models, but does not pay too much attention to improving
their ability to distinguish between confusable words. Contrastive learning, whose aim is to
minimize the distance in representation space between similar sample pairs, has recently become
a dominant technique in natural language processing. Inspired by contrastive learning, we present
a novel framework for Chinese spelling checking, which consists of three modules: language
representation, spelling check and reverse contrastive learning. Specifically, we propose a reverse
contrastive learning strategy, which explicitly forces the model to minimize the agreement
between the similar examples, namely, the phonetically and visually confusable characters.
Experimental results show that our framework is model-agnostic and could be combined with
existing Chinese spelling check models to yield state-of-the-art performance.
1. Introduction
Chinese spelling check (CSC) is an important natural language processing (NLP) task which lays the foundation for
many NLP downstream applications, such as optical character recognition (OCR) (Wang, Song, Li, Han and Zhang,
2018;Hong, Yu, He, Liu and Liu,2019) or automated essay scoring(Uto, Xie and Ueno,2020). Meanwhile, it is a
challenging task which demands the competence comparable to humans, in natural language understanding (Liu, Lai,
Chuang and Lee,2010;Liu, Cheng, Luo, Duh and Matsumoto,2013;Xin, Zhao, Wang and Jia,2014). Most recent
successes on this task are achieved by the non-autoregressive models like BERT, since the length of the output needs
to be exactly the same as that of the input, and at the same time each character in the source sequence should share the
same position with its counterpart in the target.
As to the classification of Chinese characters, some of them are hieroglyphs while most of them are semantic-
phonetic compound characters (Norman,1988). Consequently, though it may be impossible to enumerate all spelling
mistakes, the error patterns could still be roughly summarized as visual or phonetic errors (Chang,1995). Actually,
according to statistics, over 80% of all spelling mistakes are related to the phonetic resemblance between the characters.
If a CSC model could well distinguish between the phonetically and visually confusable characters, it would be of great
help to improving its performance in correcting spelling errors. However, currently few methods consider making use
of the phonetic and visual information to help tackle the confusable character issue.
On the other hand, self-supervised representation learning has significantly advanced due to the application of
contrastive learning (Chen, Kornblith, Norouzi and Hinton,2020;Henaff,2020;Oord, Li and Vinyals,2018;Wu,
Xiong, Yu and Lin,2018), whose main idea is to train a model to maximize the agreement between a target example
(“anchor”) and a similar (“positive”) example in embedding space, while also maximize the disagreement between
this target and other dissimilar (“negative”) examples. Khosla, Teterwak, Wang, Sarna, Tian, Isola, Maschinot, Liu
and Krishnan (2020) first allowed contrastive learning to be applied in the fully-supervised setting, namely, Supervised
Contrastive Learning (SCL), which could effectively leverage label information to distinguish the positive and negative
examples of an anchor. Contrastive learning has brought improvements to many NLP tasks, including aspect sentiment
classification (Ke, Liu, Xu and Shu,2021), text classification (Suresh and Ong,2021) and semantic textual similarity
Corresponding author
neakail@outlook.com (N. Lin); jiangshengyi@163.com (S. Jiang); amyang@gdut.edu.cn (A. Yang)
ORCID(s): 0000-0003-2838-8273 (N. Lin)
Nankai Lin et al.: Preprint submitted to Elsevier Page 1 of 10
arXiv:2210.13823v2 [cs.CL] 6 Jul 2023
A Chinese Spelling Check Framework
(Gao, Yao and Chen,2021). With respect to CSC, Li, Zhou, Li, Li, Liu, Sun, Wang, Li, Cao and Zheng (2022c) are
the first to employ constrastive learning. They proposed an error-driven method to guide the model to learn to tell right
and wrong answers.
Enlightened by contrastive learning, in this study we present a simple yet effective CSC framework designed to
enhance the performance of existing CSC models. While the objective of contrastive learning is to pull together the
similar examples, we creatively propose to pull apart those phonetically or visually similar characters, which would
lend a helping hand to the models in distinguishing between confusable characters.
Our contributions could be summarized as follows:
(1) We entend the idea of contrastive learning to CSC task and propose Reverse Contrastive Learning (RCL)
strategy which results in models that better detect and correct spelling errors related to confusable characters.
(2) We put forward a model-agnostic CSC framework, in which our RCL strategy, as a subsidiary component, could
be easily combined with existing CSC models to yield better performance.
(3) The CSC models equipped with our strategy establish new state-of-the-art on SIGHAN benchmarks.
2. Related Work
2.1. Chinese Spelling Check
In recent years, the research on CSC has been mainly working on twofold: data generation for CSC and CSC-
oriented language models.
Data generation for CSC. Wang et al. (2018) proposed a method to automatically construct a CSC corpus, which
generates visually or phonologically similar characters based on OCR and ASR recognition technology, respectively,
thereby greatly expanding the scale of the corpus. Duan, Pan, Wang, Zhang and Wu (2019) built two corpora: a visual
corpus and phonological one. While the construction of both corpora makes use of ASR technology, the phonological
one also uses the conversion between Chinese characters and the sounds of the characters.
CSC-oriented language model. Recently, researchers have mainly focused on employing language models to
capture information in terms of character similarity and phonological similarity, facilitating the CSC task. The models
are dominated by neural network-based models (Cheng, Xu, Chen, Jiang, Wang, Wang, Chu and Qi,2020;Ma, Hu,
Peng, Zheng and Xu,2023) , especially pre-trained language models. Related studies have explored the potential
semantic modeling capability of pre-trained language models, with BERT being widely utilized as the backbone of
CSC models. However, the methods may lead to overfitting of CSC models to the errors in sentences and disturbance
of semantic encoding of sentences, yielding poor generalization (Zhang, Zheng, Yan and Qiu,2022b;Wu, Zhang,
Zhang and Zhao,2023). To address the issues,Yang (2023) proposed the n-gram masking layer to alleviate common
label leakage and error disturbance problems. Regarding another line of pre-trained models involving the fusion
of textual, visual and phonetic information into pre-trained models, Wang, Che, Wu, Wang, Hu and Liu (2021)
presented the Dynamic Connected Networks (DCN) based on the non-autoregressive model, which utilizes a Pinyin
enhanced candidate generator to generate candidate Chinese characters and then models the dependencies between two
neighboring Chinese characters using an attention-based network. SCOPE (Li, Wang, Mao, Guo, Yang and Zhang,
2022a) enhances the performance of the CSC model by imposing an auxiliary pronunciation prediction task and
devising an iterative inference strategy, similar CSC models based on multimodal information as SpellBERT (Ji,
Yan and Qiu,2021), PLOME (Liu, Yang, Yue, Zhang and Wang,2021) and ReaLiSe (Xu, Li, Zhou, Li, Wang, Cao,
Huang and Mao,2021). Despite the effectiveness of multimodal information, there are still some inherent problems.
Specifically, given that direct integration of phonetic information may affect the raw text representation and weaken the
effect of phonetic information, Liang, Quan and Wang (2023) decoupled text and phonetic representation and designed
a pinyin-to-character pre-training task to enhance the effect of phonetic knowledge, simultaneously introducing a
self-distillation module to prevent the model from overfitting phonetic information. Moreover, to address the issue of
characters’ relative positions not being well-aligned in multimodal spaces, Liang, Huang, Li and Shi (2022) presented
distance fusion and knowledge enhanced framework for CSC to capture relative distances between characters and
reconstruct a semantic representation, simultaneously reducing errors caused by unseen fixed expressions.
The aforesaid models introduce character similarity and phonological similarity information from the confusion
set, neglecting contextual similarity, which is demonstrated more valuable in the research conducted by Zhang, Li,
Zhou, Ma, Li, Cao and Zheng (2023), where a simple yet effective curriculum learning framework is designed to guide
CSC models explicitly to capture the contextual similarity between Chinese characters.
Nankai Lin et al.: Preprint submitted to Elsevier Page 2 of 10
A Chinese Spelling Check Framework
Figure 1: The proposed Chinses spelling check framework.
2.2. Contrastive Learning for Chinese Spelling Check
Although contrastive learning has promoted various NLP applications, directly applying it to CSC tasks has
limitations. One difficulty lies in constructing suitable examples using data augmentation or existing labels. Zhang, Yan,
Yu and Qiu (2022a) employed a self-distillation method to construct positive samples for contrastive learning, which
uniformed the hidden states of erroneous tokens to be closer to the corresponding correct tokens to learn better feature
representation. LEAD framework (Li, Ma, Zhou, Li, Li, Huang, Liu, Li, Cao and Zheng,2022b) guides the CSC models
to learn better phonetics, vision and definition knowledge from a dictionary exploiting unified contrastive learning,
where positive and negative samples are constructed based on the information of character phonetics, glyphs, and
definitions from external dictionary. Li et al. (2022c) proposed the Error-driven Contrastive Probability Optimization
(ECOPO) framework. ECOPO optimizes the knowledge representation of the pre-trained model and guides the model
to avoid predicting these common features in an error-driven manner. Nevertheless, the method of learning from
dictionary leverages the heterogeneous knowledge from a external large-scale dictionary, increasing the cost of training.
ECOPO relies on the output of the model to generate negative samples, and hence the quality of these samples is subject
to the performance of the model. Unlike the methods, we attempt to adopt the pinyin information from the batch and
confusion set to highly efficiently construct the negative samples required for contrastive learning, so as to present
accurate contrastive information to the model.
3. The Chinses spelling check framework
In this work, we treat spelling check as a non-autoregressive task. Given an input text sequence 𝑋=
{𝑥1, 𝑥2, . . . , 𝑥𝑁}and its corresponding Pinyin (Mandarin phonetic transcription) sequence 𝑦= {𝑦1, 𝑦2, . . . , 𝑦𝑁}(where
𝑁notates the length of the sequence), the ultimate goal is to automatically detect the incorrect characters in the
sequence and then output the rectified target sequence 𝑍= {𝑧1, 𝑧2, . . . , 𝑧𝑁}. The proposed framework for CSC consists
of three modules: language representation, spelling correctness and reverse contrastive learning (See Figure 1).
3.1. Language representation module
In the language representation module, we adopt a pretrained model to encode the input. Trained on huge
unannotated text data, the pretrained model could memorize the language regularity in its parameters and hence offer
rich contextual representations to the downstream modules. Given an input sequence 𝑋, the model would first project
Nankai Lin et al.: Preprint submitted to Elsevier Page 3 of 10
摘要:

AChineseSpellingCheckFrameworkBasedonReverseContrastiveLearningNankaiLina,HongyanWub,SihuiFub,ShengyiJiangb,c,∗andAiminYanga,∗aSchoolofComputerScienceandTechnology,GuangdongUniversityofTechnology,Guangzhou,510000,ChinabSchoolofInformationScienceandTechnology,GuangdongUniversityofForeignStudies,Guang...

展开>> 收起<<
A Chinese Spelling Check Framework Based on Reverse Contrastive Learning.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:632.99KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注