Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training Taolin Zhang12 Junwei Dong23 Jianing Wang12 Chengyu Wang2 Ang Wang2

2025-04-29 0 0 515.26KB 11 页 10玖币

侵权投诉

Revisiting and Advancing Chinese Natural Language Understanding with

Accelerated Heterogeneous Knowledge Pre-training

Taolin Zhang1,2, Junwei Dong2,3, Jianing Wang1,2, Chengyu Wang2∗

, Ang Wang2,

Yinghui Liu2, Jun Huang2, Yong Li2, Xiaofeng He1

1East China Normal University, Shanghai, China

2Alibaba Group, Hangzhou, China

3Chongqing University, Chongqing, China

zhangtl0519@gmail.com,chengyu.wcy@alibaba-inc.com

Abstract

Recently, knowledge-enhanced pre-trained

language models (KEPLMs) improve context-

aware representations via learning from struc-

tured relations in knowledge graphs, and/or

linguistic knowledge from syntactic or depen-

dency analysis. Unlike English, there is a

lack of high-performing open-source Chinese

KEPLMs in the natural language processing

(NLP) community to support various language

understanding applications. In this paper, we

revisit and advance the development of Chi-

nese natural language understanding with a

series of novel Chinese KEPLMs released

in various parameter sizes, namely CKBERT

(Chinese knowledge-enhanced BERT). Specif-

ically, both relational and linguistic knowledge

is effectively injected into CKBERT based on

two novel pre-training tasks, i.e., linguistic-

aware masked language modeling and con-

trastive multi-hop relation modeling. Based on

the above two pre-training paradigms and our

in-house implemented TorchAccelerator, we

have pre-trained base (110M), large (345M)

and huge (1.3B) versions of CKBERT efﬁ-

ciently on GPU clusters. Experiments demon-

strate that CKBERT outperforms strong base-

lines for Chinese over various benchmark NLP

tasks and in terms of different model sizes. 1

1 Introduction

Pre-trained Language Models (PLMs) such as

BERT (Devlin et al.,2019) are pre-trained by self-

supervised learning on large-scale text corpora

to capture the rich semantic knowledge of words

(Li et al.,2021;Gong et al.,2022), improving

various downstream NLP tasks signiﬁcantly (He

et al.,2020;Xu et al.,2021;Chang et al.,2021).

Although these PLMs have stored much internal

knowledge (Petroni et al.,2019,2020), they can

∗Corresponding author.

All the codes and model checkpoints have been released

to public in the EasyNLP framework (Wang et al.,2022).

URL: https://github.com/alibaba/EasyNLP.

hardly understand external background knowledge

from the world such as factual and linguistic knowl-

edge (Colon-Hernandez et al.,2021;Cui et al.,

2021;Lai et al.,2021).

In the literature, most approaches of knowledge

injection can be divided into two categories, includ-

ing relational knowledge and linguistic knowledge.

(1) Relational knowledge-based approaches inject

entity and relation representations in Knowledge

Graphs (KGs) trained by knowledge embedding al-

gorithms (Zhang et al.,2019;Peters et al.,2019) or

convert triples into sentences for joint pre-training

(Liu et al.,2020;Sun et al.,2020). (2) Linguis-

tic knowledge-based approaches extract semantic

units from pre-training sentences such as part-of-

speech tags, constituent and dependency syntactic

parsing, and feed all linguistic information into var-

ious transformer-based architectures (Zhou et al.,

2020;Lai et al.,2021). We observe that there

can be three potential drawbacks. (1) These ap-

proaches generally utilize a single source of knowl-

edge (i.e., inherent linguistic knowledge), which

ignore important knowledge from other sources (Su

et al.,2021) (i.e., relational knowledge from KGs).

(2) Training large-scale KEPLMs from scratch re-

quires high-memory computing devices and is time-

consuming, which brings signiﬁcant computational

burdens for users (Zhang et al.,2021,2022). (3)

Most of these models are pre-trained in English

only. There is a lack of powerful KEPLMs for

understanding other languages (Lee et al.,2020;

Pérez et al.,2021).

To overcome the above problems, we release a

series of Chinese KEPLMs named CKBERT (Chi-

nese knowledge-enhanced BERT), with heteroge-

neous knowledge sources injected. We particularly

focus on Chinese as it is one of the most widely spo-

ken languages other than English. The CKBERT

models are pre-trained by two well-designed pre-

training tasks as follows:

•Linguistic-aware Masked Language Mod-

arXiv:2210.05287v2 [cs.CL] 12 Oct 2022

eling (LMLM):

LMLM is substantially ex-

tended from Masked Language Modeling

(MLM) (Devlin et al.,2019) by introducing

two key linguistics tokens derived from de-

pendency syntactic parsing and semantic role

labeling. We also insert unique markers for

each linguistic component among contiguous

tokens. The goal of LMLM is to predict both

randomly selected tokens and linguistic to-

kens masked in the pre-training sentences.

•Contrastive Multi-hop Relation Modeling

(CMRM):

We sample ﬁne-grained subgraphs

from a large-scale Chinese KG by multi-hop

relations to compensate for understanding

the background knowledge of target entities.

Speciﬁcally, we construct positive triples for

matched target entities via retrieving one-hop

entities in the corresponding subgraphs. Neg-

ative triples are sampled from unrelated multi-

hop entities through the relation paths in the

KG. The CMRM task is proposed to pull the

semantics of similar entities close and push

away those with irrelevant semantics.

Based on the above heterogeneous knowledge

pre-training tasks, we produce various sizes of CK-

BERT models to meet the inference time and ac-

curacy requirements of different real-world scenar-

ios (Brown et al.,2020;Chowdhery et al.,2022),

including base (110M), large (345M) and huge

(1.3B). The models are pre-trained using our in-

house implemented TorchAccelerator that effec-

tively transforms PyTorch eager execution to graph

execution on distributed GPU clusters, boosting

the training speed by 40% per sample with our

advanced compiler technique based on Acceler-

ated Linear Algebra (XLA). In the experiments, we

compare CKBERT against strong baseline PLMs

and KEPLMs on various Chinese general and

knowledge-related NLP tasks. The results demon-

strate the improvement of CKBERT compared to

SoTA models.

2 Related Work

We brieﬂy summarize the related work on the fol-

lowing two aspects: PLMs and KEPLMs.

2.1 PLMs

Following BERT (Devlin et al.,2019), many PLMs

have been proposed to improve performance in var-

ious NLP tasks. Several approaches extend BERT

by employing novel token-level and sentence-level

pre-training tasks. Notable PLMs include ERNIE-

Baidu (Sun et al.,2019), MacBERT (Cui et al.,

2020) and PERT (Cui et al.,2022) for Chinese

NLU downstream tasks. Other models boost the

performance by changing the internal encoder ar-

chitectures. For example, XLNet (Yang et al.,

2019) utilizes Transformer-XL (Dai et al.,2019)

to encode long sequences by the permutation in

language tokens. Sparse self-attention (Cui et al.,

2019) replaces the self-attention mechanism with

more interpretable attention units. Yet, other PLMs

such as MT-DNN (Liu et al.,2019) combine self-

supervised pre-training with the multi-task super-

vised learning to improve the performance of vari-

ous GLUE tasks (Wang et al.,2019).

2.2 KEPLMs

These models use structured knowledge or linguis-

tic semantics to enhance the language understand-

ing abilities of PLMs. We summarize recent KE-

PLMs grouped into the following four types: (1)

Knowledge-enhancement by linguistic semantics.

These works use the linguistic information already

available in the pre-training sentences to enhance

the understanding ability of PLMs. Lattice-BERT

(Lai et al.,2021) pre-trains a Chinese PLM over a

word lattice (Buckman and Neubig,2018) structure

to exploit multi-granularity inputs. (2) Knowledge-

enhancement by entity embeddings. For exam-

ple, ERNIE-THU (Zhang et al.,2019) injects en-

tity embeddings into contextual representations via

knowledge-encoders stacked by the information

fusion module. (3) Knowledge-enhancement by

entity descriptions. These approaches learn entity

embeddings by knowledge descriptions. For ex-

ample, pre-training corpora and entity descriptions

in KEPLER (Wang et al.,2021) are encoded into

a uniﬁed semantic space within the same PLM.

(4) Knowledge-enhancement by converted triplet’s

texts. K-BERT (Liu et al.,2020) and CoLAKE

(Sun et al.,2020) convert relation triplets into texts

and insert them into training samples without using

pre-trained embeddings. In this paper, we argue

that aggregating heterogeneous knowledge infor-

mation can further beneﬁt the context-aware repre-

sentations of PLMs.

3 Model

In this section, we elaborate the techniques of the

proposed CKBERT model. The main architecture

!"#$%&'()*+,(#-&.$$(/$%0/+1+223+4536

!"#$%&'(&)*+,-. Pre-training Sentence

Data

Source

90 1 2 3 4 5

pos. sample

0 2 6

0 2 7

0 5 8

neg. samples

0 2 69

Token:

Pre-training

Data Processing

contrastive relation triples linguistic masked tokens

Model Pre-training

Tasks

LMLM CMRM

0 2

0 2 6

0 2 7

0 2 6 9

✅❎

[CLS]

[CLS] 都 [SDP] [/SDP] 多多实战, 才能

[DEP] [/DEP] 口语发音。[SEP]

知道

大家

AGT

真正改善

ADV

大家都 [SDP] 知道[/SDP]

多实战，才能真正 [DEP] 改善 [/DEP]

多

…

Position: 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21

AGT: agent

ADV: adverbial

We all know that more practice is the only way to truly improve oral pronunciation.

we know

improve

truly

Figure 1: Model overview. The LMLM task is not only able to perform random masked token prediction (similar

to BERT) but also to predict masked linguistic-aware tokens. The CMRM task injects external relation triples into

PLMs through neighboring multi-hop relations. (Best viewed in color.)

of CKBERT is ﬁrstly presented in Figure 1.

3.1 Model Architecture

It accepts a sequence of

WordPiece tokens (Wu

et al.,2016),

(x1, x2, ..., xM)

as input, and com-

putes the

-dimensional contextual representations

Hi∈RM×D

by successively stacking

trans-

former encoder layers. We do not modify the ar-

chitecture here to guarantee that CKBERT can be

seamlessly integrated into any industrial applica-

tions that BERT supports with better performance.

3.2 Linguistic-aware Masked Language

Modeling (LMLM)

In BERT pre-training, 15% of all token positions

are randomly masked for prediction. However, ran-

dom masked tokens may be unimportant units such

as conjunctions and prepositions (Clark et al.,2019;

Hao et al.,2021). We reconstruct the input sen-

tences and mask more tokens based on linguistic

knowledge so that CKBERT can better understand

the semantics of important tokens in pre-training

sentences. Speciﬁcally, we use the following three

steps to mask the linguistic input units:

•Recognizing Linguistic Tokens:

We ﬁrst use

Without loss of generality, we focus on the transformer

encoder architecture only; yet our work can also be extended

model architectures with slight modiﬁcation.

the off-the-shelf tool

to recognize important

units in pre-training sentences, including de-

pendence grammar and semantic dependency

parsing. The extracted relations here serve

as important sources of linguistic knowledge,

including “subject-verb”, “verb-object” and

“adverbial” for dependence grammar and “non-

agent” for semantic dependency parsing.

•Reconstructing Input Sentences:

In addi-

tion to the original input form, based on the

subjects and objects of the extracted linguis-

tic relations, we insert special identiﬁers for

each lexicon unit between words spans to give

explicit boundary information for model pre-

training. For example, we add

[DEP]

and

[/DEP]

for dependence grammar and

[SDP]

and [/SDP] for dependency parsing tokens.

•Choosing Masked Tokens:

We choose 15%

of token positions from the reconstructed in-

put sentence for masking, using the special

token

[MASK]

. Among these tokens, we assign

40% of the positions to randomly selected to-

kens and the rest to linguistic tokens. Note

that these special identiﬁers (

[DEP]

[/DEP]

[SDP]

and

[/SDP]

) are also treated as normal

tokens for masking, thus the model needs to

3http://ltp.ai/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RevisitingandAdvancingChineseNaturalLanguageUnderstandingwithAcceleratedHeterogeneousKnowledgePre-trainingTaolinZhang1;2,JunweiDong2;3,JianingWang1;2,ChengyuWang2,AngWang2,YinghuiLiu2,JunHuang2,YongLi2,XiaofengHe11EastChinaNormalUniversity,Shanghai,China2AlibabaGroup,Hangzhou,China3ChongqingUnivers...

展开>> 收起<<

Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training Taolin Zhang12 Junwei Dong23 Jianing Wang12 Chengyu Wang2 Ang Wang2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training Taolin Zhang12 Junwei Dong23 Jianing Wang12 Chengyu Wang2 Ang Wang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: