Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling Peijie Jiang1Dingkun Long Yanzhao Zhang Pengjun Xie

2025-05-06 1 0 693.92KB 12 页 10玖币

侵权投诉

Unsupervised Boundary-Aware Language Model Pretraining for Chinese

Sequence Labeling

Peijie Jiang1Dingkun Long Yanzhao Zhang Pengjun Xie

Meishan Zhang2∗Min Zhang2

1School of New Media and Communication, Tianjin University, China

2Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen)

jzx555@tju.edu.cn,{zhangmeishan,zhangmin2021}@hit.edu.cn

{longdingkun1993,zhangyanzhao00,xpjandy}@gmail.com

Abstract

Boundary information is critical for various

Chinese language processing tasks, such as

word segmentation, part-of-speech tagging,

and named entity recognition. Previous stud-

ies usually resorted to the use of a high-quality

external lexicon, where lexicon items can of-

fer explicit boundary information. However,

to ensure the quality of the lexicon, great hu-

man effort is always necessary, which has been

generally ignored. In this work, we suggest un-

supervised statistical boundary information in-

stead, and propose an architecture to encode

the information directly into pre-trained lan-

guage models, resulting in Boundary-Aware

BERT (BABERT). We apply BABERT for

feature induction of Chinese sequence label-

ing tasks. Experimental results on ten bench-

marks of Chinese sequence labeling demon-

strate that BABERT can provide consistent im-

provements on all datasets. In addition, our

method can complement previous supervised

lexicon exploration, where further improve-

ments can be achieved when integrated with

external lexicon information.

1 Introduction

The representative sequence labeling tasks for the

Chinese language, such as word segmentation, part-

of-speech (POS) tagging and named entity recogni-

tion (NER) (Emerson,2005;Jin and Chen,2008),

have been inclined to be performed at the character-

level in an end-to-end manner (Shen et al.,2016).

The paradigm, naturally, is standard to Chinese

word segmentation (CWS), while for Chinese POS

tagging and NER, it can better help reduce the error

propagation (Sun and Uszkoreit,2012;Yang et al.,

2016;Liu et al.,2019a) compared with word-based

counterparts by straightforward modeling.

Recently, all the above tasks have reached state-

of-the-art performances with the help of BERT-

like pre-trained language models (Yan et al.,2019;

∗Corresponding author.

Meng et al.,2019). The BERT variants, such as

BERT-wwm (Cui et al.,2021), ERNIE (Sun et al.,

2019), ZEN (Diao et al.,2020), NEZHA (Wei et al.,

2019), etc., further improve the vanilla BERT by ei-

ther using external knowledge or larger-scale train-

ing corpus. The improvements can also beneﬁt

character-level Chinese sequence labeling tasks.

Notably, since the output tags of all these

character-level Chinese sequence labeling tasks in-

volve identifying Chinese words or entities (Zhang

and Yang,2018;Yang et al.,2019), prior bound-

ary knowledge could be highly helpful for them.

A number of studies propose the integration of an

external lexicon to enhance their baseline models

by feature representation learning (Jia et al.,2020;

Tian et al.,2020a;Liu et al.,2021). Moreover,

some works suggest injecting similar resources into

the pre-trained BERT weights. BERT-wwm (Cui

et al.,2021) and ERNIE (Sun et al.,2019) are the

representatives, which leverage an external lexicon

for masked word prediction in Chinese BERT.

The lexicon-based methods have indeed

achieved great success for boundary integration.

However, there are two major drawbacks. First,

the lexicon resources are always constructed

manually (Zhang and Yang,2018;Diao et al.,

2020;Jia et al.,2020;Liu et al.,2021), which

is expensive and time-consuming. The quality

of the lexicon is critical to our tasks. Second,

different tasks as well as different domains require

different lexicons (Jia et al.,2020;Liu et al.,2021).

A well-studied lexicon for word segmentation

might be inappropriate for NER, and a lexicon

for news NER might also be problematic for

ﬁnance NER. The two drawbacks can be due to the

supervised characteristic of these lexicon-based

enhancements. Thus, it is more desirable to offer

boundary information in an unsupervised manner.

In this paper, we propose an unsupervised

Boundary-Aware BERT (BABERT), which is

achieved by fully exploring the potential of statisti-

arXiv:2210.15231v1 [cs.CL] 27 Oct 2022

cal features mined from a large-scale raw corpus.

We extract a set of N-grams (a predeﬁned ﬁxed N)

no matter they are valid words or entities, and then

calculate their corresponding unsupervised statisti-

cal features, which are mostly related to boundary

information. We inject the boundary information

into the internal layer of a pre-trained BERT, so

that our ﬁnal BABERT model can approximate the

boundary knowledge softly by using inside repre-

sentations. The BABERT model has no difference

from the original BERT, so that we can use it in the

same way as the standard BERT exploration.

We conduct experiments on three Chinese se-

quence labeling tasks to demonstrate the effective-

ness of our proposed method. Experimental re-

sults show that our approach can signiﬁcantly out-

perform other Chinese pre-trained language mod-

els. In addition, compared with supervised lexicon-

based methods, BABERT obtains competitive re-

sults on all tasks and achieves further improve-

ments when integrated with external lexicon knowl-

edge. We also conduct extensive analyses to under-

stand our method comprehensively1.

Our contributions in this paper include the fol-

lowing: 1) We design a method to encode un-

supervised statistical boundary information into

boundary-aware representation, 2) propose a new

pre-trained language model called BABERT as

a boundary-aware extension for BERT, 3) verify

BABERT on ten benchmark datasets of three Chi-

nese sequence labeling tasks.

2 Related Work

In the past decades, machine learning has achieved

good performance on sequence labeling tasks

with statistical information (Bellegarda,2004;Low

et al.,2005;Bouma,2009). Recently, neural mod-

els have led to state-of-the-art results for Chinese

sequence labeling (Lample et al.,2016;Ma and

Hovy,2016;Chiu and Nichols,2016). In addi-

tion, the presence of language representation mod-

els such as BERT (Devlin et al.,2019) has led

to impressive improvements. In particular, many

variants of BERT are devoted to integrating bound-

ary information into BERT to improve Chinese se-

quence labeling (Diao et al.,2020;Jia et al.,2020;

Liu et al.,2021).

The pre-trained model and code will be publicly avail-

able at

http://github.com/modelscope/adaseq/

examples/babert

Statistical Machine Learning

Statistical infor-

mation is critical for sequence labeling. Previous

works attempt to count such information from large

corpora in order to combine it with machine learn-

ing methods for sequence labeling (Bellegarda,

2004;Liang,2005;Bouma,2009). Peng et al.

(2004) attempts to conduct sequence labeling by

CRF and a statistical-based new word discovery

method. Low et al. (2005) introduce a maximum

entropy approach for sequence labeling. Liang

(2005) utilizes unsupervised statistical information

in Markov models, and gets a boost on Chinese

NER and CWS.

Pre-trained Language Model

Pre-trained lan-

guage model is a hot topic in natural language pro-

cessing (NLP) communities (Devlin et al.,2019;

Liu et al.,2019b;Wei et al.,2019;Clark et al.,

2020;Diao et al.,2020;Zhang et al.,2021) and

has been extensively studied for Chinese sequence

labeling. For instance, TENER (Yan et al.,2019)

adopts Transformer encoder to model character-

level features for Chinese NER. Glyce (Meng et al.,

2019) uses BERT to capture the contextual rep-

resentation combined with glyph embeddings for

Chinese sequence labeling.

Lexicon-based Methods

In recent studies, lexi-

con knowledge has been applied to improve model

performance. There are two mainstream categories

to the work of lexicon enhancement. The ﬁrst aims

to enhance the original BERT with implicit bound-

ary information by using the multi-granularity

word masking mechanism. BERT-wwm (Cui et al.,

2021) and ERNIE (Sun et al.,2019) are represen-

tatives of this category, which propose to mask

tokens, entities, and phrases as the mask units

in the masked language modeling (MLM) task to

learn the coarse-grained lexicon information dur-

ing pre-training. ERNIE-Gram (Xiao et al.,2021),

an extension of ERNIE, utilizes statistical bound-

ary information for unsupervised word extraction

to support masked word prediction, The second

category, which includes ZEN (Diao et al.,2020),

EEBERT (Jia et al.,2020), and LEBERT (Liu et al.,

2021), exploits the potential of directly injecting

lexicon information into BERT via extra modules,

leading to better performance but is limited in pre-

deﬁned external knowledge. Our work follows the

ﬁrst line of work, most similar to ERNIE-Gram.

However, different from ERNIE-Gram, we do not

discretize the real-valued statistical information ex-

PMI

LRE

  

MLM Loss



MSE Loss



(c). Boundary-Aware BERT Learning

Input Sentence

Raw Corpus

N-gram Statistical Dictionary

Contextual N-gram Sets

······

N-gram Set 

of 

N-gram Set 

of 





 

······



  

··· ···



 



N-gram Set 

of 

Pre-Trained Language Model

Representation Composition

(b). Boundary-Aware BERT Representation(a). Boundary Information Extractor





······





······

LE Rep PMI Rep RE Rep

Unsupervised

Information Mining

-th BERT Layer

-th BERT Layer

-th BERT Layer

 

Figure 1: The overall architecture of the boundary-aware pre-trained language model, which consists of three parts:

(a) boundary information extractor, (b) boundary-aware representation, and (c) boundary-aware BERT Learning.

The boundary-aware objective LBA is deﬁned in Equation 7.

tracted from corpus, but adopt a regression manner

to leverage the information fully.

3 Method

Figure 1shows the overall architecture of our un-

supervised boundary-aware pre-trained language

model, which mainly consists of three compo-

nents: 1) boundary information extractor for un-

supervised statistical boundary information min-

ing, 2) boundary-aware representation to integrate

statistical information at the character-level, and

3) boundary-aware BERT learning which injects

boundary knowledge into the internal layer of

BERT. In this section, we ﬁrst focus on the details

of the above components, and then introduce the

ﬁne-tuning method for Chinese sequence labeling.

3.1 Boundary Information Extractor

Statistical boundary information has been shown

with a positive inﬂuence on a variety of Chinese

NLP tasks (Song and Xia,2012;Higashiyama et al.,

2019;Ding et al.,2020;Xiao et al.,2021). We

follow this line of work, designing a boundary in-

formation extractor to mine statistical information

from a large raw corpus in an unsupervised way.

The overall ﬂow of the extractor includes two

steps: I) First, we collect all N-grams from the

raw corpus to build a dictionary

, in which we

count the frequencies of each N-gram and ﬁlter out

the low frequencies items; II) second, considering

that word frequency is insufﬁcient for represent-

ing the ﬂexible boundary relation in the Chinese

context, we further compute two unsupervised in-

dicators which can capture most of the boundary

information in the corpus. In the following, we will

describe these two indicators in detail.

Pointwise Mutual Information (PMI)

Given

an N-gram, we split it into two sub-strings and com-

pute the mutual information (MI) between them as

a candidate. Then, we enumerate all sub-string

pairs and choose the minimum MI as the overall

PMI to estimate the tightness of the N-gram. Let

g={c1...cm}

be an N-gram that consists of

characters, we calculate PMI using this formula:

PMI(g) = min

i∈[1:m−1]{p(g)

p(c1...ci)·p(ci+1...cm)},(1)

where

p(·)

denotes the probability over the corpus.

Note that, when

m= 1

, the corresponding PMI is

constantly equal to 1. The higher PMI indicates that

the N-gram (e.g., "

贝克汉姆

(Beckham)") has a

similar occurrence probability to the sub-string pair

(e.g., "

贝克

(Beck)" and "

汉姆

(Ham)"), leading

to a higher association between internal sub-string

pairs, which makes the N-gram more likely to be a

word/entity. In contrast, a lower PMI means the N-

gram (e.g., "

克汉

(Kehan)") is possibly an invalid

word/entity.

Left and Right Entropy (LRE)

Given an N-

gram

, we ﬁrst collect a left-adjacent character

set

m={cl

1, ..., cl

nl}

with

characters. Then,

we utilize the conditional probability between

and its left adjacent characters in

to compute

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedBoundary-AwareLanguageModelPretrainingforChineseSequenceLabelingPeijieJiang1DingkunLongYanzhaoZhangPengjunXieMeishanZhang2MinZhang21SchoolofNewMediaandCommunication,TianjinUniversity,China2InstituteofComputingandIntelligence,HarbinInstituteofTechnology(Shenzhen)jzx555@tju.edu.cn,{zhangm...

展开>> 收起<<

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling Peijie Jiang1Dingkun Long Yanzhao Zhang Pengjun Xie.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling Peijie Jiang1Dingkun Long Yanzhao Zhang Pengjun Xie

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: