Improving Chinese Spelling Check by Character Pronunciation Prediction The Effects of Adaptivity and Granularity Jiahao Li1 Quan Wang2 Zhendong Mao1 Junbo Guo3 Yanyan Yang4 Yongdong Zhang1

2025-05-08 0 0 997.77KB 12 页 10玖币

侵权投诉

Improving Chinese Spelling Check by Character Pronunciation Prediction:

The Effects of Adaptivity and Granularity

Jiahao Li1, Quan Wang2∗

, Zhendong Mao1, Junbo Guo3, Yanyan Yang4, Yongdong Zhang1

1University of Science and Technology of China, Hefei, China

2MOE Key Laboratory of Trustworthy Distributed Computing and Service,

Beijing University of Posts and Telecommunications, Beijing, China

3People’s Daily Online Co., Beijing, China

4People’s Public Security University of China, Beijing, China

jiahao66@mail.ustc.edu.cn, wangquan@bupt.edu.cn, zdmao@ustc.edu.cn

guojunbo@people.cn, zhyd73@ustc.edu.cn

Abstract

Chinese spelling check (CSC) is a fundamen-

tal NLP task that detects and corrects spelling

errors in Chinese texts. As most of these

spelling errors are caused by phonetic simi-

larity, effectively modeling the pronunciation

of Chinese characters is a key factor for CSC.

In this paper, we consider introducing an aux-

iliary task of Chinese pronunciation predic-

tion (CPP) to improve CSC, and, for the ﬁrst

time, systematically discuss the adaptivity and

granularity of this auxiliary task. We pro-

pose SCOPE which builds on top of a shared

encoder two parallel decoders, one for the

primary CSC task and the other for a ﬁne-

grained auxiliary CPP task, with a novel adap-

tive weighting scheme to balance the two tasks.

In addition, we design a delicate iterative cor-

rection strategy for further improvements dur-

ing inference. Empirical evaluation shows that

SCOPE achieves new state-of-the-art on three

CSC benchmarks, demonstrating the effective-

ness and superiority of the auxiliary CPP task.

Comprehensive ablation studies further verify

the positive effects of adaptivity and granular-

ity of the task. Code and data used in this pa-

per are publicly available at https://github.

com/jiahaozhenbang/SCOPE.

1 Introduction

Chinese Spelling Check (CSC), which aims to de-

tect and correct spelling errors in Chinese texts, is a

fundamental task in Chinese natural language pro-

cessing. Spelling errors mainly originate from hu-

man writing errors and machine recognition errors,

e.g., errors caused by automatic speech recognition

(ASR) and optical character recognition (OCR) sys-

tems (Huang et al.,2021). With the latest develop-

ment of deep neural networks, neural CSC methods,

∗Corresponding author: Quan Wang.

Instance Similarity

Coarse Fine

W: 我觉得你们会好好的完(wan2/w,an,2)。

1 1

I think you will ﬁnish well.

R: 我觉得你们会好好的玩(wan2/w,an,2)。

I think you will play well.

W:我以前想要高(gao1/g,ao,1)诉你。

I tried to high you before.

R: 我以前想要告(gao4/g,ao,4)诉你。

I tried to tell you before.

W:他收(shou1/sh,ou,1)到山上的时候。

When he received the mountain.

R: 他走(zou3/z,ou,3)到山上的时候。

When he walked up the mountain.

W:行为都被蓝(lan2/l,an,2)控设备录影。

0 0

Actions are recorded by blue control devices.

R: 行为都被监(jian1/j,ian,1)控设备录影。

Actions are recorded by surveillance devices.

Table 1: Instances from SIGHAN15 (Tseng et al.,

2015). For each instance, coarse-/ﬁne-grained pinyin

of the misspelled (red) and correct (blue) characters are

provided, along with their phonological similarity de-

gree (the fraction of identical components) in terms of

these two types of pinyin.

in particular those based on encoder-decoder archi-

tectures, have become the mainstream of research

in recent years (Xu et al.,2021;Liu et al.,2021).

Encoder-decoder models regard CSC as a special

sequence-to-sequence (Seq2Seq) problem, where a

sentence with spelling errors is given as the input

and a corrected sentence of the same length will be

generated as the output.

Previous research has shown that about 76% of

Chinese spelling errors are induced by phonologi-

cal similarity (Liu et al.,2011). Hence, it is a cru-

cial factor to effectively model the pronunciation of

Chinese characters for the CSC task. In fact, almost

arXiv:2210.10996v1 [cs.CL] 20 Oct 2022

all current advanced CSC approaches have actually

exploited, either explicitly or implicitly, character

pronunciation. The implicit use takes into account

phonological similarities between pairs of charac-

ters, e.g., by increasing the decoding probability

of characters with similar pronunciation (Cheng

et al.,2020) or integrating such similarities into

the encoding process via graph convolutional net-

works (GCNs) (Cheng et al.,2020). The explicit

use considers directly the pronunciation, or more

speciﬁcally, pinyin

, of individual characters, en-

coding the pinyin of input characters to produce

extra phonetic features (Xu et al.,2021;Huang

et al.,2021) or decoding the pinyin of target cor-

rect characters to serve as an auxiliary prediction

task (Liu et al.,2021;Ji et al.,2021).

This paper also considers improving CSC with

auxiliary character pronunciation prediction (CPP),

but focuses speciﬁcally on the adaptivity and gran-

ularity of the auxiliary task, which have never been

systematically studied before. First, all the prior at-

tempts in similar spirit simply assigned a universal

trade-off between the primary and auxiliary tasks

for all instances during training, while ignoring the

fact that the auxiliary task might provide different

levels of beneﬁts given different instances. Take for

example the instances shown in Table 1. Compared

to the misspelled character “

蓝

” and its correction

“

监

” in the 4th instance, the two characters “

完

”

and “

玩

” in the 1st instance are much more similar

in pronunciation, suggesting that the spelling error

there is more likely to be caused by phonological

similarity, to which the pronunciation-related auxil-

iary task might provide greater beneﬁts and hence

should be assigned a larger weight. Second, prior

efforts mainly explored predicting the whole pinyin

of a character, e.g., “gao1” for “

高

”. Nevertheless,

a syllable in Chinese is inherently composed of an

initial, a ﬁnal, and a tone, e.g., “g”, “ao”, and “1”

for “

高

”. This ﬁne-grained phonetic representation

can better reﬂect not only the intrinsic regularities

of Chinese pronunciation, but also the phonological

similarities between Chinese characters. Consider

for example the “

高

” and “

告

” case from the 2nd

instance in Table 1. These two characters show no

similarity in terms of their whole pinyin, but actu-

ally they share the same initial and ﬁnal, differing

solely in their tones.

Based on the above intuitions we devise

Pinyin is the ofﬁcial phonetic system of Mandarin Chi-

nese, which literally means “spelled sounds”.

SCOPE

(i.e.,

pelling

heck by pr

nunciation

diction), which introduces a ﬁne-grained CPP

task with an adaptive task weighting scheme to

improve CSC. Figure 1provides an overview of

SCOPE. Given a sentence with spelling errors as

input, we encode it using ChineseBERT (Sun et al.,

2021) to produce semantic and phonetic features.

Then we build on top of the encoder two parallel

decoders, one to generate target correct characters,

i.e., the primary CSC task, and the other to predict

the initial, ﬁnal and tone of the pinyin of each target

character, i.e., the auxiliary ﬁne-grained CPP task.

The trade-off between the two tasks can be further

adjusted adaptively for each instance, according

to the phonological similarity between input and

target characters therein. In addition, we design an

iterative correction strategy during inference to ad-

dress the over-correction issue and tackle difﬁcult

instances with consecutive errors.

We empirically evaluate SCOPE on three shared

benchmarks, and achieve substantial and consistent

improvements over previous state-of-the-art on all

three benchmarks, demonstrating the effectiveness

and superiority of our auxiliary CPP task. Compre-

hensive ablation studies further verify the positive

effects of adaptivity and granularity of the task.

The main contributions of this paper are summa-

rized as follows: (1) We investigate the possibility

of introducing an auxiliary CPP task to improve

CSC and, for the ﬁrst time, systematically discuss

the adaptivity and granularity of this auxiliary task.

(2) We propose SCOPE, which builds two parallel

decoders upon a shared encoder for CSC and CPP,

with a novel adaptive weighting scheme to balance

the two tasks. (3) We establish new state-of-the-art

on three benchmarking CSC datasets.

2 Related Work

CSC is a fundamental NLP task that has received

wide attention over the past decades. Early work on

this topic was mainly based on manually designed

rules (Mangu and Brill,1997;Jiang et al.,2012).

After that, statistical language models became the

mainstream for CSC (Chen et al.,2013;Yu and Li,

2014;Tseng et al.,2015). Methods of this kind in

general followed a pipeline of error detection, can-

didate generation, and candidate selection. Given

a sentence, the error positions are ﬁrst detected by

the perplexity of a language model. The candidates

for corrections can then be generated according to

similarity between characters, typically by using

Figure 1: Overview of SCOPE. Top: The one-encoder-two-decoder structure for CSC and CPP. The input sentence

Xis fed into the encoder and then, after character-/pronunciation-speciﬁc feature projection, two parallel decoders,

one to predict the characters, the other to predict the initial, ﬁnal, and tone of each character in the target sentence.

Bottom: Adaptive task weighting between CSC and CPP (detached in the backward pass). The target sentence Y

is fed into the encoder and the pronunciation-speciﬁc feature projection layer. Then the similarities between input

and target sentences on character level are calculated and the adaptive weights are accordingly deﬁned. Note: Only

the CSC decoder branch (along with the encoder) will be used at inference time.

a confusion set. And the ﬁnal corrections can be

determined by scoring the sentence replaced by

the candidates with the language model (Liu et al.,

2013;Xie et al.,2015).

In the era of deep learning, especially after Trans-

former (Vaswani et al.,2017) and pre-trained lan-

guage models like BERT (Devlin et al.,2019) were

proposed, a large number of neural CSC methods

have emerged. Hong et al. (2019) used Transformer

as an encoder to produce candidates and designed

a conﬁdence-similarity decoder to ﬁlter these can-

didates. Zhang et al. (2020) designed a detection

network based on Bi-GRU to predict the error prob-

ability of each character and passed the probabili-

ties to a BERT-based correction network via a soft

masking mechanism. Cheng et al. (2020) employed

GCNs combined with BERT to further model inter-

dependences between characters. Recent work of

(Xu et al.,2021;Liu et al.,2021;Huang et al.,2021)

proposed to encode phonetic and glyph informa-

tion in addition to semantic information, and then

combine phonetic, glyph and semantic features to

make ﬁnal predictions.

As we could see, modeling pronunciation infor-

mation is prevailing in CSC research (Zhang et al.,

2021), typically via an encoding process to extract

phonetic features. Liu et al. (2021) proposed the

ﬁrst work that considered predicting the pronuncia-

tion of target characters as an auxiliary task. Their

work, however, employed pronunciation prediction

in a coarse-grained, non-adaptive manner, which is

quite different to ours.

3 Our Approach

This section presents our approach SCOPE for the

CSC task. Below, we ﬁrst deﬁne the problem for-

mulation and then describe our approach in detail.

3.1 Problem Formulation

The Chinese spelling check (CSC) task is to detect

and correct spelling errors in Chinese texts. Given a

misspelled sentence

X={x1, x2,· · · , xn}

with

characters, a CSC model takes Xas input, detects

potential spelling errors on character level, and out-

puts a corresponding correct sentence

Y={y1, y2,

· · · , yn}

of equal length. This task can be viewed

as a conditional sequence generation problem that

models the probability of

p(Y|X)

. We are further

given the ﬁne-grained pinyin of each character

in the correct sentence

, represented as a triplet

in the form of

(αi, βi, γi)

, where

αi

βi

, and

γi

in-

dicate the initial, ﬁnal, and tone, respectively. Note

that such kind of pinyin of the output sentence is

required and provided solely during training.2

In fact, we also use the pinyin of each character

in the

input sentence

during the ChineseBERT encoding process

(detailed later), and this kind of pinyin of the input sentence is

required and provided during both training and inference.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingChineseSpellingCheckbyCharacterPronunciationPrediction:TheEffectsofAdaptivityandGranularityJiahaoLi1,QuanWang2,ZhendongMao1,JunboGuo3,YanyanYang4,YongdongZhang11UniversityofScienceandTechnologyofChina,Hefei,China2MOEKeyLaboratoryofTrustworthyDistributedComputingandService,BeijingUniversity...

展开>> 收起<<

Improving Chinese Spelling Check by Character Pronunciation Prediction The Effects of Adaptivity and Granularity Jiahao Li1 Quan Wang2 Zhendong Mao1 Junbo Guo3 Yanyan Yang4 Yongdong Zhang1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Chinese Spelling Check by Character Pronunciation Prediction The Effects of Adaptivity and Granularity Jiahao Li1 Quan Wang2 Zhendong Mao1 Junbo Guo3 Yanyan Yang4 Yongdong Zhang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: