
A Chinese Spelling Check Framework
(Gao, Yao and Chen,2021). With respect to CSC, Li, Zhou, Li, Li, Liu, Sun, Wang, Li, Cao and Zheng (2022c) are
the first to employ constrastive learning. They proposed an error-driven method to guide the model to learn to tell right
and wrong answers.
Enlightened by contrastive learning, in this study we present a simple yet effective CSC framework designed to
enhance the performance of existing CSC models. While the objective of contrastive learning is to pull together the
similar examples, we creatively propose to pull apart those phonetically or visually similar characters, which would
lend a helping hand to the models in distinguishing between confusable characters.
Our contributions could be summarized as follows:
(1) We entend the idea of contrastive learning to CSC task and propose Reverse Contrastive Learning (RCL)
strategy which results in models that better detect and correct spelling errors related to confusable characters.
(2) We put forward a model-agnostic CSC framework, in which our RCL strategy, as a subsidiary component, could
be easily combined with existing CSC models to yield better performance.
(3) The CSC models equipped with our strategy establish new state-of-the-art on SIGHAN benchmarks.
2. Related Work
2.1. Chinese Spelling Check
In recent years, the research on CSC has been mainly working on twofold: data generation for CSC and CSC-
oriented language models.
Data generation for CSC. Wang et al. (2018) proposed a method to automatically construct a CSC corpus, which
generates visually or phonologically similar characters based on OCR and ASR recognition technology, respectively,
thereby greatly expanding the scale of the corpus. Duan, Pan, Wang, Zhang and Wu (2019) built two corpora: a visual
corpus and phonological one. While the construction of both corpora makes use of ASR technology, the phonological
one also uses the conversion between Chinese characters and the sounds of the characters.
CSC-oriented language model. Recently, researchers have mainly focused on employing language models to
capture information in terms of character similarity and phonological similarity, facilitating the CSC task. The models
are dominated by neural network-based models (Cheng, Xu, Chen, Jiang, Wang, Wang, Chu and Qi,2020;Ma, Hu,
Peng, Zheng and Xu,2023) , especially pre-trained language models. Related studies have explored the potential
semantic modeling capability of pre-trained language models, with BERT being widely utilized as the backbone of
CSC models. However, the methods may lead to overfitting of CSC models to the errors in sentences and disturbance
of semantic encoding of sentences, yielding poor generalization (Zhang, Zheng, Yan and Qiu,2022b;Wu, Zhang,
Zhang and Zhao,2023). To address the issues,Yang (2023) proposed the n-gram masking layer to alleviate common
label leakage and error disturbance problems. Regarding another line of pre-trained models involving the fusion
of textual, visual and phonetic information into pre-trained models, Wang, Che, Wu, Wang, Hu and Liu (2021)
presented the Dynamic Connected Networks (DCN) based on the non-autoregressive model, which utilizes a Pinyin
enhanced candidate generator to generate candidate Chinese characters and then models the dependencies between two
neighboring Chinese characters using an attention-based network. SCOPE (Li, Wang, Mao, Guo, Yang and Zhang,
2022a) enhances the performance of the CSC model by imposing an auxiliary pronunciation prediction task and
devising an iterative inference strategy, similar CSC models based on multimodal information as SpellBERT (Ji,
Yan and Qiu,2021), PLOME (Liu, Yang, Yue, Zhang and Wang,2021) and ReaLiSe (Xu, Li, Zhou, Li, Wang, Cao,
Huang and Mao,2021). Despite the effectiveness of multimodal information, there are still some inherent problems.
Specifically, given that direct integration of phonetic information may affect the raw text representation and weaken the
effect of phonetic information, Liang, Quan and Wang (2023) decoupled text and phonetic representation and designed
a pinyin-to-character pre-training task to enhance the effect of phonetic knowledge, simultaneously introducing a
self-distillation module to prevent the model from overfitting phonetic information. Moreover, to address the issue of
characters’ relative positions not being well-aligned in multimodal spaces, Liang, Huang, Li and Shi (2022) presented
distance fusion and knowledge enhanced framework for CSC to capture relative distances between characters and
reconstruct a semantic representation, simultaneously reducing errors caused by unseen fixed expressions.
The aforesaid models introduce character similarity and phonological similarity information from the confusion
set, neglecting contextual similarity, which is demonstrated more valuable in the research conducted by Zhang, Li,
Zhou, Ma, Li, Cao and Zheng (2023), where a simple yet effective curriculum learning framework is designed to guide
CSC models explicitly to capture the contextual similarity between Chinese characters.
Nankai Lin et al.: Preprint submitted to Elsevier Page 2 of 10