Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling Peijie Jiang1Dingkun Long Yanzhao Zhang Pengjun Xie

2025-05-06 0 0 693.92KB 12 页 10玖币
侵权投诉
Unsupervised Boundary-Aware Language Model Pretraining for Chinese
Sequence Labeling
Peijie Jiang1Dingkun Long Yanzhao Zhang Pengjun Xie
Meishan Zhang2Min Zhang2
1School of New Media and Communication, Tianjin University, China
2Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen)
jzx555@tju.edu.cn,{zhangmeishan,zhangmin2021}@hit.edu.cn
{longdingkun1993,zhangyanzhao00,xpjandy}@gmail.com
Abstract
Boundary information is critical for various
Chinese language processing tasks, such as
word segmentation, part-of-speech tagging,
and named entity recognition. Previous stud-
ies usually resorted to the use of a high-quality
external lexicon, where lexicon items can of-
fer explicit boundary information. However,
to ensure the quality of the lexicon, great hu-
man effort is always necessary, which has been
generally ignored. In this work, we suggest un-
supervised statistical boundary information in-
stead, and propose an architecture to encode
the information directly into pre-trained lan-
guage models, resulting in Boundary-Aware
BERT (BABERT). We apply BABERT for
feature induction of Chinese sequence label-
ing tasks. Experimental results on ten bench-
marks of Chinese sequence labeling demon-
strate that BABERT can provide consistent im-
provements on all datasets. In addition, our
method can complement previous supervised
lexicon exploration, where further improve-
ments can be achieved when integrated with
external lexicon information.
1 Introduction
The representative sequence labeling tasks for the
Chinese language, such as word segmentation, part-
of-speech (POS) tagging and named entity recogni-
tion (NER) (Emerson,2005;Jin and Chen,2008),
have been inclined to be performed at the character-
level in an end-to-end manner (Shen et al.,2016).
The paradigm, naturally, is standard to Chinese
word segmentation (CWS), while for Chinese POS
tagging and NER, it can better help reduce the error
propagation (Sun and Uszkoreit,2012;Yang et al.,
2016;Liu et al.,2019a) compared with word-based
counterparts by straightforward modeling.
Recently, all the above tasks have reached state-
of-the-art performances with the help of BERT-
like pre-trained language models (Yan et al.,2019;
Corresponding author.
Meng et al.,2019). The BERT variants, such as
BERT-wwm (Cui et al.,2021), ERNIE (Sun et al.,
2019), ZEN (Diao et al.,2020), NEZHA (Wei et al.,
2019), etc., further improve the vanilla BERT by ei-
ther using external knowledge or larger-scale train-
ing corpus. The improvements can also benefit
character-level Chinese sequence labeling tasks.
Notably, since the output tags of all these
character-level Chinese sequence labeling tasks in-
volve identifying Chinese words or entities (Zhang
and Yang,2018;Yang et al.,2019), prior bound-
ary knowledge could be highly helpful for them.
A number of studies propose the integration of an
external lexicon to enhance their baseline models
by feature representation learning (Jia et al.,2020;
Tian et al.,2020a;Liu et al.,2021). Moreover,
some works suggest injecting similar resources into
the pre-trained BERT weights. BERT-wwm (Cui
et al.,2021) and ERNIE (Sun et al.,2019) are the
representatives, which leverage an external lexicon
for masked word prediction in Chinese BERT.
The lexicon-based methods have indeed
achieved great success for boundary integration.
However, there are two major drawbacks. First,
the lexicon resources are always constructed
manually (Zhang and Yang,2018;Diao et al.,
2020;Jia et al.,2020;Liu et al.,2021), which
is expensive and time-consuming. The quality
of the lexicon is critical to our tasks. Second,
different tasks as well as different domains require
different lexicons (Jia et al.,2020;Liu et al.,2021).
A well-studied lexicon for word segmentation
might be inappropriate for NER, and a lexicon
for news NER might also be problematic for
finance NER. The two drawbacks can be due to the
supervised characteristic of these lexicon-based
enhancements. Thus, it is more desirable to offer
boundary information in an unsupervised manner.
In this paper, we propose an unsupervised
Boundary-Aware BERT (BABERT), which is
achieved by fully exploring the potential of statisti-
arXiv:2210.15231v1 [cs.CL] 27 Oct 2022
cal features mined from a large-scale raw corpus.
We extract a set of N-grams (a predefined fixed N)
no matter they are valid words or entities, and then
calculate their corresponding unsupervised statisti-
cal features, which are mostly related to boundary
information. We inject the boundary information
into the internal layer of a pre-trained BERT, so
that our final BABERT model can approximate the
boundary knowledge softly by using inside repre-
sentations. The BABERT model has no difference
from the original BERT, so that we can use it in the
same way as the standard BERT exploration.
We conduct experiments on three Chinese se-
quence labeling tasks to demonstrate the effective-
ness of our proposed method. Experimental re-
sults show that our approach can significantly out-
perform other Chinese pre-trained language mod-
els. In addition, compared with supervised lexicon-
based methods, BABERT obtains competitive re-
sults on all tasks and achieves further improve-
ments when integrated with external lexicon knowl-
edge. We also conduct extensive analyses to under-
stand our method comprehensively1.
Our contributions in this paper include the fol-
lowing: 1) We design a method to encode un-
supervised statistical boundary information into
boundary-aware representation, 2) propose a new
pre-trained language model called BABERT as
a boundary-aware extension for BERT, 3) verify
BABERT on ten benchmark datasets of three Chi-
nese sequence labeling tasks.
2 Related Work
In the past decades, machine learning has achieved
good performance on sequence labeling tasks
with statistical information (Bellegarda,2004;Low
et al.,2005;Bouma,2009). Recently, neural mod-
els have led to state-of-the-art results for Chinese
sequence labeling (Lample et al.,2016;Ma and
Hovy,2016;Chiu and Nichols,2016). In addi-
tion, the presence of language representation mod-
els such as BERT (Devlin et al.,2019) has led
to impressive improvements. In particular, many
variants of BERT are devoted to integrating bound-
ary information into BERT to improve Chinese se-
quence labeling (Diao et al.,2020;Jia et al.,2020;
Liu et al.,2021).
1
The pre-trained model and code will be publicly avail-
able at
http://github.com/modelscope/adaseq/
examples/babert
Statistical Machine Learning
Statistical infor-
mation is critical for sequence labeling. Previous
works attempt to count such information from large
corpora in order to combine it with machine learn-
ing methods for sequence labeling (Bellegarda,
2004;Liang,2005;Bouma,2009). Peng et al.
(2004) attempts to conduct sequence labeling by
CRF and a statistical-based new word discovery
method. Low et al. (2005) introduce a maximum
entropy approach for sequence labeling. Liang
(2005) utilizes unsupervised statistical information
in Markov models, and gets a boost on Chinese
NER and CWS.
Pre-trained Language Model
Pre-trained lan-
guage model is a hot topic in natural language pro-
cessing (NLP) communities (Devlin et al.,2019;
Liu et al.,2019b;Wei et al.,2019;Clark et al.,
2020;Diao et al.,2020;Zhang et al.,2021) and
has been extensively studied for Chinese sequence
labeling. For instance, TENER (Yan et al.,2019)
adopts Transformer encoder to model character-
level features for Chinese NER. Glyce (Meng et al.,
2019) uses BERT to capture the contextual rep-
resentation combined with glyph embeddings for
Chinese sequence labeling.
Lexicon-based Methods
In recent studies, lexi-
con knowledge has been applied to improve model
performance. There are two mainstream categories
to the work of lexicon enhancement. The first aims
to enhance the original BERT with implicit bound-
ary information by using the multi-granularity
word masking mechanism. BERT-wwm (Cui et al.,
2021) and ERNIE (Sun et al.,2019) are represen-
tatives of this category, which propose to mask
tokens, entities, and phrases as the mask units
in the masked language modeling (MLM) task to
learn the coarse-grained lexicon information dur-
ing pre-training. ERNIE-Gram (Xiao et al.,2021),
an extension of ERNIE, utilizes statistical bound-
ary information for unsupervised word extraction
to support masked word prediction, The second
category, which includes ZEN (Diao et al.,2020),
EEBERT (Jia et al.,2020), and LEBERT (Liu et al.,
2021), exploits the potential of directly injecting
lexicon information into BERT via extra modules,
leading to better performance but is limited in pre-
defined external knowledge. Our work follows the
first line of work, most similar to ERNIE-Gram.
However, different from ERNIE-Gram, we do not
discretize the real-valued statistical information ex-
PMI
LRE
  
MLM Loss

MSE Loss

(c). Boundary-Aware BERT Learning
Input Sentence
Raw Corpus
N-gram Statistical Dictionary
Contextual N-gram Sets
······
······
N-gram Set
of
N-gram Set
of

 
······

 
··· ···



N-gram Set
of
Pre-Trained Language Model
Representation Composition
(b). Boundary-Aware BERT Representation(a). Boundary Information Extractor


······

······
······
LE Rep PMI Rep RE Rep
Unsupervised
Information Mining
-th BERT Layer
-th BERT Layer
-th BERT Layer
 
Figure 1: The overall architecture of the boundary-aware pre-trained language model, which consists of three parts:
(a) boundary information extractor, (b) boundary-aware representation, and (c) boundary-aware BERT Learning.
The boundary-aware objective LBA is defined in Equation 7.
tracted from corpus, but adopt a regression manner
to leverage the information fully.
3 Method
Figure 1shows the overall architecture of our un-
supervised boundary-aware pre-trained language
model, which mainly consists of three compo-
nents: 1) boundary information extractor for un-
supervised statistical boundary information min-
ing, 2) boundary-aware representation to integrate
statistical information at the character-level, and
3) boundary-aware BERT learning which injects
boundary knowledge into the internal layer of
BERT. In this section, we first focus on the details
of the above components, and then introduce the
fine-tuning method for Chinese sequence labeling.
3.1 Boundary Information Extractor
Statistical boundary information has been shown
with a positive influence on a variety of Chinese
NLP tasks (Song and Xia,2012;Higashiyama et al.,
2019;Ding et al.,2020;Xiao et al.,2021). We
follow this line of work, designing a boundary in-
formation extractor to mine statistical information
from a large raw corpus in an unsupervised way.
The overall flow of the extractor includes two
steps: I) First, we collect all N-grams from the
raw corpus to build a dictionary
N
, in which we
count the frequencies of each N-gram and filter out
the low frequencies items; II) second, considering
that word frequency is insufficient for represent-
ing the flexible boundary relation in the Chinese
context, we further compute two unsupervised in-
dicators which can capture most of the boundary
information in the corpus. In the following, we will
describe these two indicators in detail.
Pointwise Mutual Information (PMI)
Given
an N-gram, we split it into two sub-strings and com-
pute the mutual information (MI) between them as
a candidate. Then, we enumerate all sub-string
pairs and choose the minimum MI as the overall
PMI to estimate the tightness of the N-gram. Let
g={c1...cm}
be an N-gram that consists of
m
characters, we calculate PMI using this formula:
PMI(g) = min
i[1:m1]{p(g)
p(c1...ci)·p(ci+1...cm)},(1)
where
p(·)
denotes the probability over the corpus.
Note that, when
m= 1
, the corresponding PMI is
constantly equal to 1. The higher PMI indicates that
the N-gram (e.g., "
(Beckham)") has a
similar occurrence probability to the sub-string pair
(e.g., "
(Beck)" and "
(Ham)"), leading
to a higher association between internal sub-string
pairs, which makes the N-gram more likely to be a
word/entity. In contrast, a lower PMI means the N-
gram (e.g., "
(Kehan)") is possibly an invalid
word/entity.
Left and Right Entropy (LRE)
Given an N-
gram
g
, we first collect a left-adjacent character
set
Sl
m={cl
1, ..., cl
nl}
with
nl
characters. Then,
we utilize the conditional probability between
g
and its left adjacent characters in
Sl
m
to compute
摘要:

UnsupervisedBoundary-AwareLanguageModelPretrainingforChineseSequenceLabelingPeijieJiang1DingkunLongYanzhaoZhangPengjunXieMeishanZhang2MinZhang21SchoolofNewMediaandCommunication,TianjinUniversity,China2InstituteofComputingandIntelligence,HarbinInstituteofTechnology(Shenzhen)jzx555@tju.edu.cn,{zhangm...

展开>> 收起<<
Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling Peijie Jiang1Dingkun Long Yanzhao Zhang Pengjun Xie.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:693.92KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注