Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training Taolin Zhang12 Junwei Dong23 Jianing Wang12 Chengyu Wang2 Ang Wang2

2025-04-29 0 0 515.26KB 11 页 10玖币
侵权投诉
Revisiting and Advancing Chinese Natural Language Understanding with
Accelerated Heterogeneous Knowledge Pre-training
Taolin Zhang1,2, Junwei Dong2,3, Jianing Wang1,2, Chengyu Wang2
, Ang Wang2,
Yinghui Liu2, Jun Huang2, Yong Li2, Xiaofeng He1
1East China Normal University, Shanghai, China
2Alibaba Group, Hangzhou, China
3Chongqing University, Chongqing, China
zhangtl0519@gmail.com,chengyu.wcy@alibaba-inc.com
Abstract
Recently, knowledge-enhanced pre-trained
language models (KEPLMs) improve context-
aware representations via learning from struc-
tured relations in knowledge graphs, and/or
linguistic knowledge from syntactic or depen-
dency analysis. Unlike English, there is a
lack of high-performing open-source Chinese
KEPLMs in the natural language processing
(NLP) community to support various language
understanding applications. In this paper, we
revisit and advance the development of Chi-
nese natural language understanding with a
series of novel Chinese KEPLMs released
in various parameter sizes, namely CKBERT
(Chinese knowledge-enhanced BERT). Specif-
ically, both relational and linguistic knowledge
is effectively injected into CKBERT based on
two novel pre-training tasks, i.e., linguistic-
aware masked language modeling and con-
trastive multi-hop relation modeling. Based on
the above two pre-training paradigms and our
in-house implemented TorchAccelerator, we
have pre-trained base (110M), large (345M)
and huge (1.3B) versions of CKBERT effi-
ciently on GPU clusters. Experiments demon-
strate that CKBERT outperforms strong base-
lines for Chinese over various benchmark NLP
tasks and in terms of different model sizes. 1
1 Introduction
Pre-trained Language Models (PLMs) such as
BERT (Devlin et al.,2019) are pre-trained by self-
supervised learning on large-scale text corpora
to capture the rich semantic knowledge of words
(Li et al.,2021;Gong et al.,2022), improving
various downstream NLP tasks significantly (He
et al.,2020;Xu et al.,2021;Chang et al.,2021).
Although these PLMs have stored much internal
knowledge (Petroni et al.,2019,2020), they can
Corresponding author.
1
All the codes and model checkpoints have been released
to public in the EasyNLP framework (Wang et al.,2022).
URL: https://github.com/alibaba/EasyNLP.
hardly understand external background knowledge
from the world such as factual and linguistic knowl-
edge (Colon-Hernandez et al.,2021;Cui et al.,
2021;Lai et al.,2021).
In the literature, most approaches of knowledge
injection can be divided into two categories, includ-
ing relational knowledge and linguistic knowledge.
(1) Relational knowledge-based approaches inject
entity and relation representations in Knowledge
Graphs (KGs) trained by knowledge embedding al-
gorithms (Zhang et al.,2019;Peters et al.,2019) or
convert triples into sentences for joint pre-training
(Liu et al.,2020;Sun et al.,2020). (2) Linguis-
tic knowledge-based approaches extract semantic
units from pre-training sentences such as part-of-
speech tags, constituent and dependency syntactic
parsing, and feed all linguistic information into var-
ious transformer-based architectures (Zhou et al.,
2020;Lai et al.,2021). We observe that there
can be three potential drawbacks. (1) These ap-
proaches generally utilize a single source of knowl-
edge (i.e., inherent linguistic knowledge), which
ignore important knowledge from other sources (Su
et al.,2021) (i.e., relational knowledge from KGs).
(2) Training large-scale KEPLMs from scratch re-
quires high-memory computing devices and is time-
consuming, which brings significant computational
burdens for users (Zhang et al.,2021,2022). (3)
Most of these models are pre-trained in English
only. There is a lack of powerful KEPLMs for
understanding other languages (Lee et al.,2020;
Pérez et al.,2021).
To overcome the above problems, we release a
series of Chinese KEPLMs named CKBERT (Chi-
nese knowledge-enhanced BERT), with heteroge-
neous knowledge sources injected. We particularly
focus on Chinese as it is one of the most widely spo-
ken languages other than English. The CKBERT
models are pre-trained by two well-designed pre-
training tasks as follows:
Linguistic-aware Masked Language Mod-
arXiv:2210.05287v2 [cs.CL] 12 Oct 2022
eling (LMLM):
LMLM is substantially ex-
tended from Masked Language Modeling
(MLM) (Devlin et al.,2019) by introducing
two key linguistics tokens derived from de-
pendency syntactic parsing and semantic role
labeling. We also insert unique markers for
each linguistic component among contiguous
tokens. The goal of LMLM is to predict both
randomly selected tokens and linguistic to-
kens masked in the pre-training sentences.
Contrastive Multi-hop Relation Modeling
(CMRM):
We sample fine-grained subgraphs
from a large-scale Chinese KG by multi-hop
relations to compensate for understanding
the background knowledge of target entities.
Specifically, we construct positive triples for
matched target entities via retrieving one-hop
entities in the corresponding subgraphs. Neg-
ative triples are sampled from unrelated multi-
hop entities through the relation paths in the
KG. The CMRM task is proposed to pull the
semantics of similar entities close and push
away those with irrelevant semantics.
Based on the above heterogeneous knowledge
pre-training tasks, we produce various sizes of CK-
BERT models to meet the inference time and ac-
curacy requirements of different real-world scenar-
ios (Brown et al.,2020;Chowdhery et al.,2022),
including base (110M), large (345M) and huge
(1.3B). The models are pre-trained using our in-
house implemented TorchAccelerator that effec-
tively transforms PyTorch eager execution to graph
execution on distributed GPU clusters, boosting
the training speed by 40% per sample with our
advanced compiler technique based on Acceler-
ated Linear Algebra (XLA). In the experiments, we
compare CKBERT against strong baseline PLMs
and KEPLMs on various Chinese general and
knowledge-related NLP tasks. The results demon-
strate the improvement of CKBERT compared to
SoTA models.
2 Related Work
We briefly summarize the related work on the fol-
lowing two aspects: PLMs and KEPLMs.
2.1 PLMs
Following BERT (Devlin et al.,2019), many PLMs
have been proposed to improve performance in var-
ious NLP tasks. Several approaches extend BERT
by employing novel token-level and sentence-level
pre-training tasks. Notable PLMs include ERNIE-
Baidu (Sun et al.,2019), MacBERT (Cui et al.,
2020) and PERT (Cui et al.,2022) for Chinese
NLU downstream tasks. Other models boost the
performance by changing the internal encoder ar-
chitectures. For example, XLNet (Yang et al.,
2019) utilizes Transformer-XL (Dai et al.,2019)
to encode long sequences by the permutation in
language tokens. Sparse self-attention (Cui et al.,
2019) replaces the self-attention mechanism with
more interpretable attention units. Yet, other PLMs
such as MT-DNN (Liu et al.,2019) combine self-
supervised pre-training with the multi-task super-
vised learning to improve the performance of vari-
ous GLUE tasks (Wang et al.,2019).
2.2 KEPLMs
These models use structured knowledge or linguis-
tic semantics to enhance the language understand-
ing abilities of PLMs. We summarize recent KE-
PLMs grouped into the following four types: (1)
Knowledge-enhancement by linguistic semantics.
These works use the linguistic information already
available in the pre-training sentences to enhance
the understanding ability of PLMs. Lattice-BERT
(Lai et al.,2021) pre-trains a Chinese PLM over a
word lattice (Buckman and Neubig,2018) structure
to exploit multi-granularity inputs. (2) Knowledge-
enhancement by entity embeddings. For exam-
ple, ERNIE-THU (Zhang et al.,2019) injects en-
tity embeddings into contextual representations via
knowledge-encoders stacked by the information
fusion module. (3) Knowledge-enhancement by
entity descriptions. These approaches learn entity
embeddings by knowledge descriptions. For ex-
ample, pre-training corpora and entity descriptions
in KEPLER (Wang et al.,2021) are encoded into
a unified semantic space within the same PLM.
(4) Knowledge-enhancement by converted triplet’s
texts. K-BERT (Liu et al.,2020) and CoLAKE
(Sun et al.,2020) convert relation triplets into texts
and insert them into training samples without using
pre-trained embeddings. In this paper, we argue
that aggregating heterogeneous knowledge infor-
mation can further benefit the context-aware repre-
sentations of PLMs.
3 Model
In this section, we elaborate the techniques of the
proposed CKBERT model. The main architecture
!"#$%&'()*+,(#-&.$$(/$%0/+1+223+4536
!"#$%&'(&)*+,-. Pre-training Sentence
Data
Source
0
2
1
3
5
6
8
74
90 1 2 3 4 5
pos. sample
0 2 6
0 2 7
0 5 8
neg. samples
0 2 69
Token:
Pre-training
Data Processing
contrastive relation triples linguistic masked tokens
Model Pre-training
Tasks
LMLM CMRM
0 2
0 2 6
0 2 7
0 2 6 9
0
[CLS]
[CLS] 都 [SDP] [/SDP] 多 多 实 战, 才 能
[DEP] [/DEP] 口 语 发 音。[SEP]
知 道
大 家
AGT
真 正 改 善
ADV
大 家 [SDP] 知道[/SDP]
多 实 战 , 才 能 真 正 [DEP] 改 善 [/DEP]
Position: 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21
AGT: agent
ADV: adverbial
We all know that more practice is the only way to truly improve oral pronunciation.
we know
improve
truly
Figure 1: Model overview. The LMLM task is not only able to perform random masked token prediction (similar
to BERT) but also to predict masked linguistic-aware tokens. The CMRM task injects external relation triples into
PLMs through neighboring multi-hop relations. (Best viewed in color.)
of CKBERT is firstly presented in Figure 1.
3.1 Model Architecture
It accepts a sequence of
M
WordPiece tokens (Wu
et al.,2016),
(x1, x2, ..., xM)
as input, and com-
putes the
D
-dimensional contextual representations
HiRM×D
by successively stacking
N
trans-
former encoder layers. We do not modify the ar-
chitecture here to guarantee that CKBERT can be
seamlessly integrated into any industrial applica-
tions that BERT supports with better performance.
2
3.2 Linguistic-aware Masked Language
Modeling (LMLM)
In BERT pre-training, 15% of all token positions
are randomly masked for prediction. However, ran-
dom masked tokens may be unimportant units such
as conjunctions and prepositions (Clark et al.,2019;
Hao et al.,2021). We reconstruct the input sen-
tences and mask more tokens based on linguistic
knowledge so that CKBERT can better understand
the semantics of important tokens in pre-training
sentences. Specifically, we use the following three
steps to mask the linguistic input units:
Recognizing Linguistic Tokens:
We first use
2
Without loss of generality, we focus on the transformer
encoder architecture only; yet our work can also be extended
model architectures with slight modification.
the off-the-shelf tool
3
to recognize important
units in pre-training sentences, including de-
pendence grammar and semantic dependency
parsing. The extracted relations here serve
as important sources of linguistic knowledge,
including “subject-verb”, “verb-object” and
“adverbial” for dependence grammar and “non-
agent” for semantic dependency parsing.
Reconstructing Input Sentences:
In addi-
tion to the original input form, based on the
subjects and objects of the extracted linguis-
tic relations, we insert special identifiers for
each lexicon unit between words spans to give
explicit boundary information for model pre-
training. For example, we add
[DEP]
and
[/DEP]
for dependence grammar and
[SDP]
and [/SDP] for dependency parsing tokens.
Choosing Masked Tokens:
We choose 15%
of token positions from the reconstructed in-
put sentence for masking, using the special
token
[MASK]
. Among these tokens, we assign
40% of the positions to randomly selected to-
kens and the rest to linguistic tokens. Note
that these special identifiers (
[DEP]
,
[/DEP]
,
[SDP]
and
[/SDP]
) are also treated as normal
tokens for masking, thus the model needs to
3http://ltp.ai/
摘要:

RevisitingandAdvancingChineseNaturalLanguageUnderstandingwithAcceleratedHeterogeneousKnowledgePre-trainingTaolinZhang1;2,JunweiDong2;3,JianingWang1;2,ChengyuWang2,AngWang2,YinghuiLiu2,JunHuang2,YongLi2,XiaofengHe11EastChinaNormalUniversity,Shanghai,China2AlibabaGroup,Hangzhou,China3ChongqingUnivers...

展开>> 收起<<
Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training Taolin Zhang12 Junwei Dong23 Jianing Wang12 Chengyu Wang2 Ang Wang2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:515.26KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注