
Unsupervised Boundary-Aware Language Model Pretraining for Chinese
Sequence Labeling
Peijie Jiang1Dingkun Long Yanzhao Zhang Pengjun Xie
Meishan Zhang2∗Min Zhang2
1School of New Media and Communication, Tianjin University, China
2Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen)
jzx555@tju.edu.cn,{zhangmeishan,zhangmin2021}@hit.edu.cn
{longdingkun1993,zhangyanzhao00,xpjandy}@gmail.com
Abstract
Boundary information is critical for various
Chinese language processing tasks, such as
word segmentation, part-of-speech tagging,
and named entity recognition. Previous stud-
ies usually resorted to the use of a high-quality
external lexicon, where lexicon items can of-
fer explicit boundary information. However,
to ensure the quality of the lexicon, great hu-
man effort is always necessary, which has been
generally ignored. In this work, we suggest un-
supervised statistical boundary information in-
stead, and propose an architecture to encode
the information directly into pre-trained lan-
guage models, resulting in Boundary-Aware
BERT (BABERT). We apply BABERT for
feature induction of Chinese sequence label-
ing tasks. Experimental results on ten bench-
marks of Chinese sequence labeling demon-
strate that BABERT can provide consistent im-
provements on all datasets. In addition, our
method can complement previous supervised
lexicon exploration, where further improve-
ments can be achieved when integrated with
external lexicon information.
1 Introduction
The representative sequence labeling tasks for the
Chinese language, such as word segmentation, part-
of-speech (POS) tagging and named entity recogni-
tion (NER) (Emerson,2005;Jin and Chen,2008),
have been inclined to be performed at the character-
level in an end-to-end manner (Shen et al.,2016).
The paradigm, naturally, is standard to Chinese
word segmentation (CWS), while for Chinese POS
tagging and NER, it can better help reduce the error
propagation (Sun and Uszkoreit,2012;Yang et al.,
2016;Liu et al.,2019a) compared with word-based
counterparts by straightforward modeling.
Recently, all the above tasks have reached state-
of-the-art performances with the help of BERT-
like pre-trained language models (Yan et al.,2019;
∗Corresponding author.
Meng et al.,2019). The BERT variants, such as
BERT-wwm (Cui et al.,2021), ERNIE (Sun et al.,
2019), ZEN (Diao et al.,2020), NEZHA (Wei et al.,
2019), etc., further improve the vanilla BERT by ei-
ther using external knowledge or larger-scale train-
ing corpus. The improvements can also benefit
character-level Chinese sequence labeling tasks.
Notably, since the output tags of all these
character-level Chinese sequence labeling tasks in-
volve identifying Chinese words or entities (Zhang
and Yang,2018;Yang et al.,2019), prior bound-
ary knowledge could be highly helpful for them.
A number of studies propose the integration of an
external lexicon to enhance their baseline models
by feature representation learning (Jia et al.,2020;
Tian et al.,2020a;Liu et al.,2021). Moreover,
some works suggest injecting similar resources into
the pre-trained BERT weights. BERT-wwm (Cui
et al.,2021) and ERNIE (Sun et al.,2019) are the
representatives, which leverage an external lexicon
for masked word prediction in Chinese BERT.
The lexicon-based methods have indeed
achieved great success for boundary integration.
However, there are two major drawbacks. First,
the lexicon resources are always constructed
manually (Zhang and Yang,2018;Diao et al.,
2020;Jia et al.,2020;Liu et al.,2021), which
is expensive and time-consuming. The quality
of the lexicon is critical to our tasks. Second,
different tasks as well as different domains require
different lexicons (Jia et al.,2020;Liu et al.,2021).
A well-studied lexicon for word segmentation
might be inappropriate for NER, and a lexicon
for news NER might also be problematic for
finance NER. The two drawbacks can be due to the
supervised characteristic of these lexicon-based
enhancements. Thus, it is more desirable to offer
boundary information in an unsupervised manner.
In this paper, we propose an unsupervised
Boundary-Aware BERT (BABERT), which is
achieved by fully exploring the potential of statisti-
arXiv:2210.15231v1 [cs.CL] 27 Oct 2022