latent types. We have tackled all above-mentioned
challenges and our framework is fully differen-
tiable and completely self-supervised. As shown
in Figure 1, given an input sentence from the pre-
training corpus, we introduce a latent typing mech-
anism to jointly selects and classifies the keywords
from the sentence into a category of randomly ini-
tialized latent types. We implement such an la-
tent classification model based on Gumbel Sam-
pling (Jang et al.,2017) to make sure the over-
all pre-training framework is differentiable. Since
there are no ground-truth labels available for the
selected keywords and latent types, we incorporate
an one-layer transformer decoder into the training
pipeline to map the fused token and latent type
representations back to the original sentence, and
use the sentence reconstruction loss to control for
adequate usefulness of the latent representations.
Our approach provides the decoder model with a
shortcut to directly access the encoded token repre-
sentations, so that the latent representation for each
of the input tokens can be learned as an auxiliary
type representation. For pre-training objectives, in
addition to minimizing the sentence reconstruction
error, we also introduce a novel typing sparsity loss
to minimize the number of token representation
selected for latent typing. A KL-divergence based
diversity loss is also proposed to encourage a di-
verse selection of the latent types. Experimental
results show that our model is able to learn inter-
pretable latent type categories in a self-supervised
manner without using any external knowledge. Be-
sides, the language model pre-trained with such
an objective also significantly improves Informa-
tion Extraction related downstream tasks in both
supervised and few-shot settings.
In summary, our contributions are three-fold:
•
We propose a fully differentiable language
model pre-training framework that enables
the model to sparsely extract sentence-level
keywords with latent types in a completely
self-supervised manner.
•
We provide comprehensive analysis and inter-
pretation for our experimental results showing
that the pre-trained model is able to extract
meaningful latent type representations.
•
Extensive experiments on IE-related down-
stream tasks demonstrate that our proposed
pre-training framework can significantly ad-
vance state-of-the-art.
2 Related Work
Knowledge-Enhanced Language Models
As
pretrained language models (Radford et al.,2018;
Devlin et al.,2019;Liu et al.,2019;Radford
et al.,2019;Brown et al.,2020;Lewis et al.,
2020a;Raffel et al.,2020) are achieving great suc-
cess on downstream NLP tasks, many research
studies focus on how to make these PLMs more
knowledgeable. Previous studies (Peters et al.,
2019;Zhang et al.,2019;Xiong et al.,2020;He
et al.,2020;Yamada et al.,2020;Qin et al.,2021;
Wang et al.,2021) either focus on designing entity-
relation-aware pre-training objectives, or modify-
ing the model architecture to make it capable of
fusing both text and entity information. How-
ever, all of these previous approaches utilize large-
scale, human-annotated, semi-structured external
resources (e.g., Wikipedia). In comparison, our
method is completely self-supervised and only
needs a text corpus for pre-training, which focuses
more on encouraging the model to learn knowledge
clusters at a latent level.
Latent Structure Learning
There are also sev-
eral studies (Liu et al.,2021;Subramani et al.,
2022) that incorporate latent structure learning into
language model pre-training. Particularly, Montero
et al. (2021) also proposes to use a transformer
decoder layer to reconstruct the original sentence
to provide training signals. However, instead of
learning coarse-grained sentence representations,
we focus on learning fine-grained latent type rep-
resentation that are interpretable and useful at the
token level. To meet this end, we propose a se-
ries of novel training objectives and architecture
designs to facilitate a sparse selection and typing
of the token representations in the latent space.
Information Extraction
Our approach to detect
sentence-level keywords with latent types is in-
spired by Information Extraction (IE) (Cowie and
Lehnert,1996), an essential NLP task that aims
to extract knowledge from texts. Although IE in-
cludes a wide range of tasks varying in what to
extract (entities, relations, events) and where to
extract from (sentences, documents, corpora), typ-
ical IE frameworks usually include two essential
steps: 1) Selection: selecting the most task-relevant
units from the inputs, 2) Classification: assign-
ing each of these a correct type label. Such a
select-and-classify framework is common to sev-
eral IE tasks, including entity extraction, event de-