Improving Chinese Spelling Check by Character Pronunciation Prediction The Effects of Adaptivity and Granularity Jiahao Li1 Quan Wang2 Zhendong Mao1 Junbo Guo3 Yanyan Yang4 Yongdong Zhang1

2025-05-08 0 0 997.77KB 12 页 10玖币
侵权投诉
Improving Chinese Spelling Check by Character Pronunciation Prediction:
The Effects of Adaptivity and Granularity
Jiahao Li1, Quan Wang2
, Zhendong Mao1, Junbo Guo3, Yanyan Yang4, Yongdong Zhang1
1University of Science and Technology of China, Hefei, China
2MOE Key Laboratory of Trustworthy Distributed Computing and Service,
Beijing University of Posts and Telecommunications, Beijing, China
3People’s Daily Online Co., Beijing, China
4People’s Public Security University of China, Beijing, China
jiahao66@mail.ustc.edu.cn, wangquan@bupt.edu.cn, zdmao@ustc.edu.cn
guojunbo@people.cn, zhyd73@ustc.edu.cn
Abstract
Chinese spelling check (CSC) is a fundamen-
tal NLP task that detects and corrects spelling
errors in Chinese texts. As most of these
spelling errors are caused by phonetic simi-
larity, effectively modeling the pronunciation
of Chinese characters is a key factor for CSC.
In this paper, we consider introducing an aux-
iliary task of Chinese pronunciation predic-
tion (CPP) to improve CSC, and, for the first
time, systematically discuss the adaptivity and
granularity of this auxiliary task. We pro-
pose SCOPE which builds on top of a shared
encoder two parallel decoders, one for the
primary CSC task and the other for a fine-
grained auxiliary CPP task, with a novel adap-
tive weighting scheme to balance the two tasks.
In addition, we design a delicate iterative cor-
rection strategy for further improvements dur-
ing inference. Empirical evaluation shows that
SCOPE achieves new state-of-the-art on three
CSC benchmarks, demonstrating the effective-
ness and superiority of the auxiliary CPP task.
Comprehensive ablation studies further verify
the positive effects of adaptivity and granular-
ity of the task. Code and data used in this pa-
per are publicly available at https://github.
com/jiahaozhenbang/SCOPE.
1 Introduction
Chinese Spelling Check (CSC), which aims to de-
tect and correct spelling errors in Chinese texts, is a
fundamental task in Chinese natural language pro-
cessing. Spelling errors mainly originate from hu-
man writing errors and machine recognition errors,
e.g., errors caused by automatic speech recognition
(ASR) and optical character recognition (OCR) sys-
tems (Huang et al.,2021). With the latest develop-
ment of deep neural networks, neural CSC methods,
Corresponding author: Quan Wang.
Instance Similarity
Coarse Fine
W: 好好(wan2/w,an,2)
1 1
I think you will finish well.
R: 好好(wan2/w,an,2)
I think you will play well.
W:(gao1/g,ao,1)
02
/
3
I tried to high you before.
R: (gao4/g,ao,4)
I tried to tell you before.
W:(shou1/sh,ou,1)
01
/
3
When he received the mountain.
R: (zou3/z,ou,3)
When he walked up the mountain.
W:(lan2/l,an,2)录影
0 0
Actions are recorded by blue control devices.
R: (jian1/j,ian,1)录影
Actions are recorded by surveillance devices.
Table 1: Instances from SIGHAN15 (Tseng et al.,
2015). For each instance, coarse-/fine-grained pinyin
of the misspelled (red) and correct (blue) characters are
provided, along with their phonological similarity de-
gree (the fraction of identical components) in terms of
these two types of pinyin.
in particular those based on encoder-decoder archi-
tectures, have become the mainstream of research
in recent years (Xu et al.,2021;Liu et al.,2021).
Encoder-decoder models regard CSC as a special
sequence-to-sequence (Seq2Seq) problem, where a
sentence with spelling errors is given as the input
and a corrected sentence of the same length will be
generated as the output.
Previous research has shown that about 76% of
Chinese spelling errors are induced by phonologi-
cal similarity (Liu et al.,2011). Hence, it is a cru-
cial factor to effectively model the pronunciation of
Chinese characters for the CSC task. In fact, almost
arXiv:2210.10996v1 [cs.CL] 20 Oct 2022
all current advanced CSC approaches have actually
exploited, either explicitly or implicitly, character
pronunciation. The implicit use takes into account
phonological similarities between pairs of charac-
ters, e.g., by increasing the decoding probability
of characters with similar pronunciation (Cheng
et al.,2020) or integrating such similarities into
the encoding process via graph convolutional net-
works (GCNs) (Cheng et al.,2020). The explicit
use considers directly the pronunciation, or more
specifically, pinyin
1
, of individual characters, en-
coding the pinyin of input characters to produce
extra phonetic features (Xu et al.,2021;Huang
et al.,2021) or decoding the pinyin of target cor-
rect characters to serve as an auxiliary prediction
task (Liu et al.,2021;Ji et al.,2021).
This paper also considers improving CSC with
auxiliary character pronunciation prediction (CPP),
but focuses specifically on the adaptivity and gran-
ularity of the auxiliary task, which have never been
systematically studied before. First, all the prior at-
tempts in similar spirit simply assigned a universal
trade-off between the primary and auxiliary tasks
for all instances during training, while ignoring the
fact that the auxiliary task might provide different
levels of benefits given different instances. Take for
example the instances shown in Table 1. Compared
to the misspelled character “
” and its correction
” in the 4th instance, the two characters “
and “
” in the 1st instance are much more similar
in pronunciation, suggesting that the spelling error
there is more likely to be caused by phonological
similarity, to which the pronunciation-related auxil-
iary task might provide greater benefits and hence
should be assigned a larger weight. Second, prior
efforts mainly explored predicting the whole pinyin
of a character, e.g., “gao1” for “
”. Nevertheless,
a syllable in Chinese is inherently composed of an
initial, a final, and a tone, e.g., “g”, “ao”, and “1”
for “
”. This fine-grained phonetic representation
can better reflect not only the intrinsic regularities
of Chinese pronunciation, but also the phonological
similarities between Chinese characters. Consider
for example the “
” and “
” case from the 2nd
instance in Table 1. These two characters show no
similarity in terms of their whole pinyin, but actu-
ally they share the same initial and final, differing
solely in their tones.
Based on the above intuitions we devise
1
Pinyin is the official phonetic system of Mandarin Chi-
nese, which literally means “spelled sounds”.
SCOPE
(i.e.,
S
pelling
C
heck by pr
O
nunciation
P
r
E
diction), which introduces a fine-grained CPP
task with an adaptive task weighting scheme to
improve CSC. Figure 1provides an overview of
SCOPE. Given a sentence with spelling errors as
input, we encode it using ChineseBERT (Sun et al.,
2021) to produce semantic and phonetic features.
Then we build on top of the encoder two parallel
decoders, one to generate target correct characters,
i.e., the primary CSC task, and the other to predict
the initial, final and tone of the pinyin of each target
character, i.e., the auxiliary fine-grained CPP task.
The trade-off between the two tasks can be further
adjusted adaptively for each instance, according
to the phonological similarity between input and
target characters therein. In addition, we design an
iterative correction strategy during inference to ad-
dress the over-correction issue and tackle difficult
instances with consecutive errors.
We empirically evaluate SCOPE on three shared
benchmarks, and achieve substantial and consistent
improvements over previous state-of-the-art on all
three benchmarks, demonstrating the effectiveness
and superiority of our auxiliary CPP task. Compre-
hensive ablation studies further verify the positive
effects of adaptivity and granularity of the task.
The main contributions of this paper are summa-
rized as follows: (1) We investigate the possibility
of introducing an auxiliary CPP task to improve
CSC and, for the first time, systematically discuss
the adaptivity and granularity of this auxiliary task.
(2) We propose SCOPE, which builds two parallel
decoders upon a shared encoder for CSC and CPP,
with a novel adaptive weighting scheme to balance
the two tasks. (3) We establish new state-of-the-art
on three benchmarking CSC datasets.
2 Related Work
CSC is a fundamental NLP task that has received
wide attention over the past decades. Early work on
this topic was mainly based on manually designed
rules (Mangu and Brill,1997;Jiang et al.,2012).
After that, statistical language models became the
mainstream for CSC (Chen et al.,2013;Yu and Li,
2014;Tseng et al.,2015). Methods of this kind in
general followed a pipeline of error detection, can-
didate generation, and candidate selection. Given
a sentence, the error positions are first detected by
the perplexity of a language model. The candidates
for corrections can then be generated according to
similarity between characters, typically by using
Figure 1: Overview of SCOPE. Top: The one-encoder-two-decoder structure for CSC and CPP. The input sentence
Xis fed into the encoder and then, after character-/pronunciation-specific feature projection, two parallel decoders,
one to predict the characters, the other to predict the initial, final, and tone of each character in the target sentence.
Bottom: Adaptive task weighting between CSC and CPP (detached in the backward pass). The target sentence Y
is fed into the encoder and the pronunciation-specific feature projection layer. Then the similarities between input
and target sentences on character level are calculated and the adaptive weights are accordingly defined. Note: Only
the CSC decoder branch (along with the encoder) will be used at inference time.
a confusion set. And the final corrections can be
determined by scoring the sentence replaced by
the candidates with the language model (Liu et al.,
2013;Xie et al.,2015).
In the era of deep learning, especially after Trans-
former (Vaswani et al.,2017) and pre-trained lan-
guage models like BERT (Devlin et al.,2019) were
proposed, a large number of neural CSC methods
have emerged. Hong et al. (2019) used Transformer
as an encoder to produce candidates and designed
a confidence-similarity decoder to filter these can-
didates. Zhang et al. (2020) designed a detection
network based on Bi-GRU to predict the error prob-
ability of each character and passed the probabili-
ties to a BERT-based correction network via a soft
masking mechanism. Cheng et al. (2020) employed
GCNs combined with BERT to further model inter-
dependences between characters. Recent work of
(Xu et al.,2021;Liu et al.,2021;Huang et al.,2021)
proposed to encode phonetic and glyph informa-
tion in addition to semantic information, and then
combine phonetic, glyph and semantic features to
make final predictions.
As we could see, modeling pronunciation infor-
mation is prevailing in CSC research (Zhang et al.,
2021), typically via an encoding process to extract
phonetic features. Liu et al. (2021) proposed the
first work that considered predicting the pronuncia-
tion of target characters as an auxiliary task. Their
work, however, employed pronunciation prediction
in a coarse-grained, non-adaptive manner, which is
quite different to ours.
3 Our Approach
This section presents our approach SCOPE for the
CSC task. Below, we first define the problem for-
mulation and then describe our approach in detail.
3.1 Problem Formulation
The Chinese spelling check (CSC) task is to detect
and correct spelling errors in Chinese texts. Given a
misspelled sentence
X={x1, x2,· · · , xn}
with
n
characters, a CSC model takes Xas input, detects
potential spelling errors on character level, and out-
puts a corresponding correct sentence
Y={y1, y2,
· · · , yn}
of equal length. This task can be viewed
as a conditional sequence generation problem that
models the probability of
p(Y|X)
. We are further
given the fine-grained pinyin of each character
yi
in the correct sentence
Y
, represented as a triplet
in the form of
(αi, βi, γi)
, where
αi
,
βi
, and
γi
in-
dicate the initial, final, and tone, respectively. Note
that such kind of pinyin of the output sentence is
required and provided solely during training.2
2
In fact, we also use the pinyin of each character
xi
in the
input sentence
X
during the ChineseBERT encoding process
(detailed later), and this kind of pinyin of the input sentence is
required and provided during both training and inference.
摘要:

ImprovingChineseSpellingCheckbyCharacterPronunciationPrediction:TheEffectsofAdaptivityandGranularityJiahaoLi1,QuanWang2,ZhendongMao1,JunboGuo3,YanyanYang4,YongdongZhang11UniversityofScienceandTechnologyofChina,Hefei,China2MOEKeyLaboratoryofTrustworthyDistributedComputingandService,BeijingUniversity...

展开>> 收起<<
Improving Chinese Spelling Check by Character Pronunciation Prediction The Effects of Adaptivity and Granularity Jiahao Li1 Quan Wang2 Zhendong Mao1 Junbo Guo3 Yanyan Yang4 Yongdong Zhang1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:997.77KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注