Language Model Pre-Training with Sparse Latent Typing Liliang Ren1 Zixuan Zhang1 Han Wang2 Clare R. Voss3 Chengxiang Zhai1 Heng Ji1 1University of Illinois at Urbana-Champaign2Amazon Alexa

2025-05-03 0 0 962KB 15 页 10玖币
侵权投诉
Language Model Pre-Training with Sparse Latent Typing
Liliang Ren1, Zixuan Zhang1
, Han Wang2, Clare R. Voss3, Chengxiang Zhai1, Heng Ji1
1University of Illinois at Urbana-Champaign, 2Amazon Alexa,
3US Army Research Laboratory
{liliang3, zixuan11, czhai, hengji}@illinois.edu
wnghn@amazon.com,clare.r.voss.civ@army.mil
Abstract
Modern large-scale Pre-trained Language
Models (PLMs) have achieved tremendous
success on a wide range of downstream tasks.
However, most of the LM pre-training objec-
tives only focus on text reconstruction, but
have not sought to learn latent-level inter-
pretable representations of sentences. In this
paper, we manage to push the language mod-
els to obtain a deeper understanding of sen-
tences by proposing a new pre-training objec-
tive, Sparse Latent Typing, which enables the
model to sparsely extract sentence-level key-
words with diverse latent types. Experimen-
tal results show that our model is able to learn
interpretable latent type categories in a self-
supervised manner without using any exter-
nal knowledge. Besides, the language model
pre-trained with such an objective also sig-
nificantly improves Information Extraction re-
lated downstream tasks in both supervised and
few-shot settings. Our code is publicly avail-
able at https://github.com/renll/SparseLT.
1 Introduction
Transformer-based Pre-trained Language Models
(PLMs) have achieved significant success on a wide
range of NLP tasks. However, typical pre-training
objectives for PLMs only focus on teaching the
model to directly reconstruct text-level words or
sentences, but have not sought to obtain deeper sen-
tence understanding by learning latent-level inter-
pretable representations. For example, transformer-
decoder models like the OpenAI GPT series (Rad-
ford et al.,2018,2019;Brown et al.,2020) adopt
Equal contribution. Listing order is random. Liliang
proposed and implemented the architecture designs and the
training objectives of Sparse Latent Typing (SLT), and he also
conducted extensive experiments for pre-training, few-shot
evaluation and the analyses. Zixuan designed the language
model pre-training pipeline for SLT, built the initial training
codebase and conducted the experiments for pre-training and
supervised evaluation. Both of the authors initially came
up with the same project goal of encouraging the model to
sparsely select sentence-level key words during pre-training.
Pretrained
LM
A Montana helicopter company
transported 40 people out from
the flooding.
Decoder
“Montana”
“helicopter”
“transported”
“40 people”
“flooding”
Type: 1
Type: 2
Type: 3
Type: 2
Type: 3
<s> A Montana helicopter
“A” Type: 0
“out
“from”
Type: 0
Type: 0
Selected Typed Tokens
Sentence
Reconstruction
Reconstruction LossSparsity Loss +
Downstream
IE tasks
Fine-tune
Figure 1: A general illustration of our approach to
teach pre-trained language model to extract sentence-
level keywords with latent type representations in a
completely self-supervised manner.
the task of language modeling for pre-training, and
transformer-encoder models like BERT (Devlin
et al.,2019) and RoBERTa (Liu et al.,2019) are
trained by predicting the masked tokens within a
sentence. Both of these training objectives merely
train the models to recover the masked tokens or
predict the next words or sentences, while ignoring
to learn latent-level representations of sentences
that could be potentially useful for both better lan-
guage understanding and downstream tasks.
Pre-training a language model to learn latent rep-
resentations is extremely hard: First, there are no
ground-truth labels for the latent representations
that could be used for reliable supervised learning.
During pre-training, the model is only given an
unlabeled text corpus over which to identify latent
representations such as sentence-level keywords
and structures. This means the training process
must be strictly self-supervised (Rush et al.,2018).
Furthermore, to be interpretable, the latent repre-
sentations for natural language texts are supposed
to be discrete, which further complicates the design
of a completely differentiable training framework.
To push the language models to learn deeper
understandings of sentences, in this paper, we pro-
pose a novel pre-training framework, Sparse Latent
Typing, that enables the language model to sparsely
extract sentence-level keywords with meaningful
arXiv:2210.12582v2 [cs.CL] 26 Oct 2022
latent types. We have tackled all above-mentioned
challenges and our framework is fully differen-
tiable and completely self-supervised. As shown
in Figure 1, given an input sentence from the pre-
training corpus, we introduce a latent typing mech-
anism to jointly selects and classifies the keywords
from the sentence into a category of randomly ini-
tialized latent types. We implement such an la-
tent classification model based on Gumbel Sam-
pling (Jang et al.,2017) to make sure the over-
all pre-training framework is differentiable. Since
there are no ground-truth labels available for the
selected keywords and latent types, we incorporate
an one-layer transformer decoder into the training
pipeline to map the fused token and latent type
representations back to the original sentence, and
use the sentence reconstruction loss to control for
adequate usefulness of the latent representations.
Our approach provides the decoder model with a
shortcut to directly access the encoded token repre-
sentations, so that the latent representation for each
of the input tokens can be learned as an auxiliary
type representation. For pre-training objectives, in
addition to minimizing the sentence reconstruction
error, we also introduce a novel typing sparsity loss
to minimize the number of token representation
selected for latent typing. A KL-divergence based
diversity loss is also proposed to encourage a di-
verse selection of the latent types. Experimental
results show that our model is able to learn inter-
pretable latent type categories in a self-supervised
manner without using any external knowledge. Be-
sides, the language model pre-trained with such
an objective also significantly improves Informa-
tion Extraction related downstream tasks in both
supervised and few-shot settings.
In summary, our contributions are three-fold:
We propose a fully differentiable language
model pre-training framework that enables
the model to sparsely extract sentence-level
keywords with latent types in a completely
self-supervised manner.
We provide comprehensive analysis and inter-
pretation for our experimental results showing
that the pre-trained model is able to extract
meaningful latent type representations.
Extensive experiments on IE-related down-
stream tasks demonstrate that our proposed
pre-training framework can significantly ad-
vance state-of-the-art.
2 Related Work
Knowledge-Enhanced Language Models
As
pretrained language models (Radford et al.,2018;
Devlin et al.,2019;Liu et al.,2019;Radford
et al.,2019;Brown et al.,2020;Lewis et al.,
2020a;Raffel et al.,2020) are achieving great suc-
cess on downstream NLP tasks, many research
studies focus on how to make these PLMs more
knowledgeable. Previous studies (Peters et al.,
2019;Zhang et al.,2019;Xiong et al.,2020;He
et al.,2020;Yamada et al.,2020;Qin et al.,2021;
Wang et al.,2021) either focus on designing entity-
relation-aware pre-training objectives, or modify-
ing the model architecture to make it capable of
fusing both text and entity information. How-
ever, all of these previous approaches utilize large-
scale, human-annotated, semi-structured external
resources (e.g., Wikipedia). In comparison, our
method is completely self-supervised and only
needs a text corpus for pre-training, which focuses
more on encouraging the model to learn knowledge
clusters at a latent level.
Latent Structure Learning
There are also sev-
eral studies (Liu et al.,2021;Subramani et al.,
2022) that incorporate latent structure learning into
language model pre-training. Particularly, Montero
et al. (2021) also proposes to use a transformer
decoder layer to reconstruct the original sentence
to provide training signals. However, instead of
learning coarse-grained sentence representations,
we focus on learning fine-grained latent type rep-
resentation that are interpretable and useful at the
token level. To meet this end, we propose a se-
ries of novel training objectives and architecture
designs to facilitate a sparse selection and typing
of the token representations in the latent space.
Information Extraction
Our approach to detect
sentence-level keywords with latent types is in-
spired by Information Extraction (IE) (Cowie and
Lehnert,1996), an essential NLP task that aims
to extract knowledge from texts. Although IE in-
cludes a wide range of tasks varying in what to
extract (entities, relations, events) and where to
extract from (sentences, documents, corpora), typ-
ical IE frameworks usually include two essential
steps: 1) Selection: selecting the most task-relevant
units from the inputs, 2) Classification: assign-
ing each of these a correct type label. Such a
select-and-classify framework is common to sev-
eral IE tasks, including entity extraction, event de-
tection and event argument extraction. Accordingly,
in our approach, we follows a similar Selection-
Classification approach to incorporate word selec-
tions and latent typing in pre-training.
3 Problem Formulation
Given a text corpus Dcomposed of text sentences
S
, we use
s={w1,· · · , wN},s∼ S
to repre-
sent a sentence consisting of
N
tokens. Assum-
ing a text encoder
f:S 7→ X
that takes the
sentence
s
as input and outputs the token repre-
sentations
x1,· · · ,xN
, our latent type classifier
h:X 7→ (Z,ˆ
X)
then selects a subset of token
representations
ˆ
x1,· · · ,ˆ
xT, T N
and classifies
them into latent type representations
z1,· · · ,zT
.
Each of the token types
zi
is selected from a latent
embedding space
C
consisting of
Vc=|C|
different
latent vectors. The text decoder
g: (Z,ˆ
X)7→ S
then reconstructs the original sentence
s
through
the pair of latent types and selected token represen-
tations (Z,ˆ
X).
The objective of sparse latent typing is to find
pairs of latent types and token representations that
are as compact as possible but still contain the nec-
essary information for reconstructing the original
input sentences. Formally, we want to minimize
the following joint objective,
min
θfhg
T, Lrec(f, h, g),LKL(f, h)(1)
with:
Lrec(f, h, g) = Es∼S [log pg(s|h(f(s)))],
LKL(f, h) = DKL(ph(z|f(s))||p(z)),
where
T
is the number of the selected token repre-
sentations,
p(z)
is a prior distribution of the latent
types, and
DKL(·||·)
is the Kullback–Leibler (KL)
divergence. The reconstruction loss and the KL
term in our formulation follows the classical VAE
(Kingma and Welling,2013), but there are two key
differences: (1) The latent variables
z
are discrete
categorical variables, (2) Instead of only taking
the latent representation
z
, the decoder takes both
the token vectors and the corresponding latent vec-
tors for sentence reconstruction. Since the discrete
version of VAE is well studied by the previous ef-
forts such as VQ-VAE (Van Den Oord et al.,2017)
and Gumbel-Softmax (Jang et al.,2017), the opti-
mization problem remains as how to minimize the
non-differentiable term
T
to encourage the sparse
selection of the token representations.
4 Learning Sparse Latent Types
To tackle the non-differentiable problem of the size
of the selected typing pairs
T=|(Z,ˆ
X)|
, we first
take a closer look at the latent type classifier
h
which decides the latent type
zi
of each token rep-
resentation
xi
. Our insight is that we can regard
the action of not selecting a token representation
as a frozen zero type vector
c1=0∈ C
. We then
do an element-wise multiplication between
zi
and
xi
to obtain the representations
¯
xi=xizi
that
are to be fed into the text decoder
g
. The advan-
tages of this approach are that (1) the element-wise
multiplication naturally prevents the gradient from
being propagated to the token representations that
are classified as the zero type vector
c1
, (2) the
element-wise multiplication directly modulates the
gradients of the token representations with the la-
tent type vectors. This can in principle provide
better guidance to the text encoder with the infor-
mation of the latent vectors than can be provided by
other vector fusion operators such as element-wise
addition or vector concatenation. Based on this
framework, we developed a novel typing sparsity
loss in Section 4.2 to approximately minimize the
typing pairs size
T
. While our approach is gener-
ally applicable for any text encoder and decoder,
specific neural architectures used in this work are
discussed in Section 5.1.
In our framework, the latent type classifier
h
is
simplified as a mapping
h0:X 7→ Z
that only
outputs the latent types
zi
for each token represen-
tation. The simplified text decoder
g0:¯
X 7→ S
then only needs to model the fused representation
space
¯
X=Z ⊗ X
for sentence reconstruction.
is the vector fusion operator and should be inter-
preted as element-wise multiplication in this work.
The proposed architecture for sparse latent typing
is illustrated in Figure 2, which is further explained
in the following subsections.
4.1 Gumbel Latent Typing
Given the token representations generated from
the text encoder,
X={x1,· · · ,xN} ∈ RN×dm
,
where
N
is the number of input tokens, and
dm
is
the length of the token representation vectors, our
Gumbel latent type classifier first maps
X
into log-
its
LRN×Vc
with a weight matrix
WRdm×Vc
,
and then outputs the probabilities
Pi,v
of choosing
the
v
-th latent type for each token representation
摘要:

LanguageModelPre-TrainingwithSparseLatentTypingLiliangRen1,ZixuanZhang1,HanWang2,ClareR.Voss3,ChengxiangZhai1,HengJi11UniversityofIllinoisatUrbana-Champaign,2AmazonAlexa,3USArmyResearchLaboratory{liliang3,zixuan11,czhai,hengji}@illinois.eduwnghn@amazon.com,clare.r.voss.civ@army.milAbstractModernla...

展开>> 收起<<
Language Model Pre-Training with Sparse Latent Typing Liliang Ren1 Zixuan Zhang1 Han Wang2 Clare R. Voss3 Chengxiang Zhai1 Heng Ji1 1University of Illinois at Urbana-Champaign2Amazon Alexa.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:962KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注