Language Model Pre-Training with Sparse Latent Typing Liliang Ren1 Zixuan Zhang1 Han Wang2 Clare R. Voss3 Chengxiang Zhai1 Heng Ji1 1University of Illinois at Urbana-Champaign2Amazon Alexa

2025-05-03 0 0 962KB 15 页 10玖币

侵权投诉

Language Model Pre-Training with Sparse Latent Typing

Liliang Ren1∗, Zixuan Zhang1∗

, Han Wang2, Clare R. Voss3, Chengxiang Zhai1, Heng Ji1

1University of Illinois at Urbana-Champaign, 2Amazon Alexa,

3US Army Research Laboratory

{liliang3, zixuan11, czhai, hengji}@illinois.edu

wnghn@amazon.com,clare.r.voss.civ@army.mil

Abstract

Modern large-scale Pre-trained Language

Models (PLMs) have achieved tremendous

success on a wide range of downstream tasks.

However, most of the LM pre-training objec-

tives only focus on text reconstruction, but

have not sought to learn latent-level inter-

pretable representations of sentences. In this

paper, we manage to push the language mod-

els to obtain a deeper understanding of sen-

tences by proposing a new pre-training objec-

tive, Sparse Latent Typing, which enables the

model to sparsely extract sentence-level key-

words with diverse latent types. Experimen-

tal results show that our model is able to learn

interpretable latent type categories in a self-

supervised manner without using any exter-

nal knowledge. Besides, the language model

pre-trained with such an objective also sig-

niﬁcantly improves Information Extraction re-

lated downstream tasks in both supervised and

few-shot settings. Our code is publicly avail-

able at https://github.com/renll/SparseLT.

1 Introduction

Transformer-based Pre-trained Language Models

(PLMs) have achieved signiﬁcant success on a wide

range of NLP tasks. However, typical pre-training

objectives for PLMs only focus on teaching the

model to directly reconstruct text-level words or

sentences, but have not sought to obtain deeper sen-

tence understanding by learning latent-level inter-

pretable representations. For example, transformer-

decoder models like the OpenAI GPT series (Rad-

ford et al.,2018,2019;Brown et al.,2020) adopt

∗

Equal contribution. Listing order is random. Liliang

proposed and implemented the architecture designs and the

training objectives of Sparse Latent Typing (SLT), and he also

conducted extensive experiments for pre-training, few-shot

evaluation and the analyses. Zixuan designed the language

model pre-training pipeline for SLT, built the initial training

codebase and conducted the experiments for pre-training and

supervised evaluation. Both of the authors initially came

up with the same project goal of encouraging the model to

sparsely select sentence-level key words during pre-training.

Pretrained

A Montana helicopter company

transported 40 people out from

the flooding.

Decoder

“Montana”

“helicopter”

“transported”

“40 people”

“flooding”

Type: 1

Type: 2

Type: 3

Type: 2

Type: 3

<s> A Montana helicopter …

“A” Type: 0

“out”

“from”

Type: 0

Selected Typed Tokens

Sentence

Reconstruction

Reconstruction LossSparsity Loss +

Downstream

IE tasks

Fine-tune

Figure 1: A general illustration of our approach to

teach pre-trained language model to extract sentence-

level keywords with latent type representations in a

completely self-supervised manner.

the task of language modeling for pre-training, and

transformer-encoder models like BERT (Devlin

et al.,2019) and RoBERTa (Liu et al.,2019) are

trained by predicting the masked tokens within a

sentence. Both of these training objectives merely

train the models to recover the masked tokens or

predict the next words or sentences, while ignoring

to learn latent-level representations of sentences

that could be potentially useful for both better lan-

guage understanding and downstream tasks.

Pre-training a language model to learn latent rep-

resentations is extremely hard: First, there are no

ground-truth labels for the latent representations

that could be used for reliable supervised learning.

During pre-training, the model is only given an

unlabeled text corpus over which to identify latent

representations such as sentence-level keywords

and structures. This means the training process

must be strictly self-supervised (Rush et al.,2018).

Furthermore, to be interpretable, the latent repre-

sentations for natural language texts are supposed

to be discrete, which further complicates the design

of a completely differentiable training framework.

To push the language models to learn deeper

understandings of sentences, in this paper, we pro-

pose a novel pre-training framework, Sparse Latent

Typing, that enables the language model to sparsely

extract sentence-level keywords with meaningful

arXiv:2210.12582v2 [cs.CL] 26 Oct 2022

latent types. We have tackled all above-mentioned

challenges and our framework is fully differen-

tiable and completely self-supervised. As shown

in Figure 1, given an input sentence from the pre-

training corpus, we introduce a latent typing mech-

anism to jointly selects and classiﬁes the keywords

from the sentence into a category of randomly ini-

tialized latent types. We implement such an la-

tent classiﬁcation model based on Gumbel Sam-

pling (Jang et al.,2017) to make sure the over-

all pre-training framework is differentiable. Since

there are no ground-truth labels available for the

selected keywords and latent types, we incorporate

an one-layer transformer decoder into the training

pipeline to map the fused token and latent type

representations back to the original sentence, and

use the sentence reconstruction loss to control for

adequate usefulness of the latent representations.

Our approach provides the decoder model with a

shortcut to directly access the encoded token repre-

sentations, so that the latent representation for each

of the input tokens can be learned as an auxiliary

type representation. For pre-training objectives, in

addition to minimizing the sentence reconstruction

error, we also introduce a novel typing sparsity loss

to minimize the number of token representation

selected for latent typing. A KL-divergence based

diversity loss is also proposed to encourage a di-

verse selection of the latent types. Experimental

results show that our model is able to learn inter-

pretable latent type categories in a self-supervised

manner without using any external knowledge. Be-

sides, the language model pre-trained with such

an objective also signiﬁcantly improves Informa-

tion Extraction related downstream tasks in both

supervised and few-shot settings.

In summary, our contributions are three-fold:

•

We propose a fully differentiable language

model pre-training framework that enables

the model to sparsely extract sentence-level

keywords with latent types in a completely

self-supervised manner.

•

We provide comprehensive analysis and inter-

pretation for our experimental results showing

that the pre-trained model is able to extract

meaningful latent type representations.

•

Extensive experiments on IE-related down-

stream tasks demonstrate that our proposed

pre-training framework can signiﬁcantly ad-

vance state-of-the-art.

2 Related Work

Knowledge-Enhanced Language Models

pretrained language models (Radford et al.,2018;

Devlin et al.,2019;Liu et al.,2019;Radford

et al.,2019;Brown et al.,2020;Lewis et al.,

2020a;Raffel et al.,2020) are achieving great suc-

cess on downstream NLP tasks, many research

studies focus on how to make these PLMs more

knowledgeable. Previous studies (Peters et al.,

2019;Zhang et al.,2019;Xiong et al.,2020;He

et al.,2020;Yamada et al.,2020;Qin et al.,2021;

Wang et al.,2021) either focus on designing entity-

relation-aware pre-training objectives, or modify-

ing the model architecture to make it capable of

fusing both text and entity information. How-

ever, all of these previous approaches utilize large-

scale, human-annotated, semi-structured external

resources (e.g., Wikipedia). In comparison, our

method is completely self-supervised and only

needs a text corpus for pre-training, which focuses

more on encouraging the model to learn knowledge

clusters at a latent level.

Latent Structure Learning

There are also sev-

eral studies (Liu et al.,2021;Subramani et al.,

2022) that incorporate latent structure learning into

language model pre-training. Particularly, Montero

et al. (2021) also proposes to use a transformer

decoder layer to reconstruct the original sentence

to provide training signals. However, instead of

learning coarse-grained sentence representations,

we focus on learning ﬁne-grained latent type rep-

resentation that are interpretable and useful at the

token level. To meet this end, we propose a se-

ries of novel training objectives and architecture

designs to facilitate a sparse selection and typing

of the token representations in the latent space.

Information Extraction

Our approach to detect

sentence-level keywords with latent types is in-

spired by Information Extraction (IE) (Cowie and

Lehnert,1996), an essential NLP task that aims

to extract knowledge from texts. Although IE in-

cludes a wide range of tasks varying in what to

extract (entities, relations, events) and where to

extract from (sentences, documents, corpora), typ-

ical IE frameworks usually include two essential

steps: 1) Selection: selecting the most task-relevant

units from the inputs, 2) Classiﬁcation: assign-

ing each of these a correct type label. Such a

select-and-classify framework is common to sev-

eral IE tasks, including entity extraction, event de-

tection and event argument extraction. Accordingly,

in our approach, we follows a similar Selection-

Classiﬁcation approach to incorporate word selec-

tions and latent typing in pre-training.

3 Problem Formulation

Given a text corpus Dcomposed of text sentences

, we use

s={w1,· · · , wN},s∼ S

to repre-

sent a sentence consisting of

tokens. Assum-

ing a text encoder

f:S 7→ X

that takes the

sentence

as input and outputs the token repre-

sentations

x1,· · · ,xN

, our latent type classiﬁer

h:X 7→ (Z,ˆ

then selects a subset of token

representations

x1,· · · ,ˆ

xT, T ≤N

and classiﬁes

them into latent type representations

z1,· · · ,zT

Each of the token types

is selected from a latent

embedding space

consisting of

Vc=|C|

different

latent vectors. The text decoder

g: (Z,ˆ

X)7→ S

then reconstructs the original sentence

through

the pair of latent types and selected token represen-

tations (Z,ˆ

X).

The objective of sparse latent typing is to ﬁnd

pairs of latent types and token representations that

are as compact as possible but still contain the nec-

essary information for reconstructing the original

input sentences. Formally, we want to minimize

the following joint objective,

min

θf,θh,θg

T, Lrec(f, h, g),LKL(f, h)(1)

with:

Lrec(f, h, g) = Es∼S [−log pg(s|h(f(s)))],

LKL(f, h) = DKL(ph(z|f(s))||p(z)),

where

is the number of the selected token repre-

sentations,

p(z)

is a prior distribution of the latent

types, and

DKL(·||·)

is the Kullback–Leibler (KL)

divergence. The reconstruction loss and the KL

term in our formulation follows the classical VAE

(Kingma and Welling,2013), but there are two key

differences: (1) The latent variables

are discrete

categorical variables, (2) Instead of only taking

the latent representation

, the decoder takes both

the token vectors and the corresponding latent vec-

tors for sentence reconstruction. Since the discrete

version of VAE is well studied by the previous ef-

forts such as VQ-VAE (Van Den Oord et al.,2017)

and Gumbel-Softmax (Jang et al.,2017), the opti-

mization problem remains as how to minimize the

non-differentiable term

to encourage the sparse

selection of the token representations.

4 Learning Sparse Latent Types

To tackle the non-differentiable problem of the size

of the selected typing pairs

T=|(Z,ˆ

X)|

, we ﬁrst

take a closer look at the latent type classiﬁer

which decides the latent type

of each token rep-

resentation

. Our insight is that we can regard

the action of not selecting a token representation

as a frozen zero type vector

c1=0∈ C

. We then

do an element-wise multiplication between

and

to obtain the representations

xi=xi⊗zi

that

are to be fed into the text decoder

. The advan-

tages of this approach are that (1) the element-wise

multiplication naturally prevents the gradient from

being propagated to the token representations that

are classiﬁed as the zero type vector

, (2) the

element-wise multiplication directly modulates the

gradients of the token representations with the la-

tent type vectors. This can in principle provide

better guidance to the text encoder with the infor-

mation of the latent vectors than can be provided by

other vector fusion operators such as element-wise

addition or vector concatenation. Based on this

framework, we developed a novel typing sparsity

loss in Section 4.2 to approximately minimize the

typing pairs size

. While our approach is gener-

ally applicable for any text encoder and decoder,

speciﬁc neural architectures used in this work are

discussed in Section 5.1.

In our framework, the latent type classiﬁer

simpliﬁed as a mapping

h0:X 7→ Z

that only

outputs the latent types

for each token represen-

tation. The simpliﬁed text decoder

g0:¯

X 7→ S

then only needs to model the fused representation

space

X=Z ⊗ X

for sentence reconstruction.

⊗

is the vector fusion operator and should be inter-

preted as element-wise multiplication in this work.

The proposed architecture for sparse latent typing

is illustrated in Figure 2, which is further explained

in the following subsections.

4.1 Gumbel Latent Typing

Given the token representations generated from

the text encoder,

X={x1,· · · ,xN} ∈ RN×dm

where

is the number of input tokens, and

the length of the token representation vectors, our

Gumbel latent type classiﬁer ﬁrst maps

into log-

its

L∈RN×Vc

with a weight matrix

W∈Rdm×Vc

and then outputs the probabilities

Pi,v

of choosing

the

-th latent type for each token representation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LanguageModelPre-TrainingwithSparseLatentTypingLiliangRen1,ZixuanZhang1,HanWang2,ClareR.Voss3,ChengxiangZhai1,HengJi11UniversityofIllinoisatUrbana-Champaign,2AmazonAlexa,3USArmyResearchLaboratory{liliang3,zixuan11,czhai,hengji}@illinois.eduwnghn@amazon.com,clare.r.voss.civ@army.milAbstractModernla...

展开>> 收起<<

Language Model Pre-Training with Sparse Latent Typing Liliang Ren1 Zixuan Zhang1 Han Wang2 Clare R. Voss3 Chengxiang Zhai1 Heng Ji1 1University of Illinois at Urbana-Champaign2Amazon Alexa.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Language Model Pre-Training with Sparse Latent Typing Liliang Ren1 Zixuan Zhang1 Han Wang2 Clare R. Voss3 Chengxiang Zhai1 Heng Ji1 1University of Illinois at Urbana-Champaign2Amazon Alexa

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: