proven to be capable of categorizing unlabeled texts
(Chang et al., 2008). With this approach, we signifi-
cantly decrease the cost of annotating data, as we only
need a small number of keywords instead of a large
number of labeled documents.
Lbl2Vec
works by creating jointly embedded word,
document, and label vectors. The label vectors are de-
ducted from predefined keywords of each topic. Since
label and document vectors are embedded in the same
feature space, we can subsequently measure their se-
mantic relationship by calculating their cosine similar-
ity. Based on this semantic similarity, we can decide
whether to assign a document to a certain topic or not.
We show that our approach produces reliable results
while saving annotation costs and requires almost no
text preprocessing steps. To this end, we apply our ap-
proach to two publicly available and commonly used
document classification datasets. Moreover, we make
our
Lbl2Vec
code publicly available as a ready-to-use
tool.
2 RELATED WORK
Most related research can be summarized under
the notion of dataless classification, introduced
by Chang et al. (2008). Broadly, this includes any
approach that aims to classify unlabeled texts based
on label descriptions only. Our approach differs
slightly from these, as we primarily attempt to retrieve
documents on predefined topics from an unlabeled
document dataset without the need to consider
documents belonging to different topics of no interest.
Nevertheless, some similarities, such as the ability of
multiclass document classification emerge, allowing a
rough comparison of our approach with those from
the dataless classification, which can further be
divided along two dimensions: 1) semi-supervised
vs. unsupervised approaches and 2) approaches that
use a large amount of additional world knowledge
vs. ones that mainly rely on the plain document corpus.
Semi-supervised
approaches seek to annotate a
small subset of the document corpus unsupervised
and subsequently leverage the labeled subset to
train a supervised classifier for the rest of the
corpus. In one of the earliest approaches that fit
into this category, Ko and Seo (2000) derive training
sentences from manually defined category keywords
unsupervised. Then, they used the derived sentences
to train a supervised Na
¨
ıve Bayes classifier with
minor modifications. Similarly, Liu et al. (2004)
extracted a subset of documents with keywords
and then applied a supervised Na
¨
ıve Bayes-based
expectation–maximization algorithm (Dempster et al.,
1977) for classification.
Unsupervised
approaches, by contrast, use similarity
scores between documents and target categories
to classify the entire unlabeled dataset. Haj-Yahia
et al. (2019) proposed keyword enrichment (
KE
) and
subsequent unsupervised classification based on latent
semantic analysis (
LSA
) (Deerwester et al., 1990)
vector cosine similarities. Another approach worth
mentioning in this context is the pure dataless hierar-
chical classification used by Song and Roth (2014)
to evaluate different semantic representations. Our
approach also fits into this unsupervised dimension,
as we do not employ document labels and retrieve
documents from the entire corpus based on cosine
similarities only.
A large amount of additional world knowl-
edge
from different data sources has been widely
exploited in many previous approaches to incorporate
more context into the semantic relationship between
documents and target categories. Chang et al. (2008)
used Wikipedia as source of world knowledge to
compute explicit semantic analysis embeddings
(Gabrilovich and Markovitch, 2007) of labels and
documents. Afterward, they applied the nearest
neighbor classification to assign the most likely
label to each document. In this regard, their early
work had a major impact on further research, which
subsequently heavily focused on adding a lot of
world knowledge for dataless classification. Yin et al.
(2019) used various public entailment datasets to
train a bidirectional encoder representations from
transformers (
BERT
) model (Devlin et al., 2019)
and used the pretrained
BERT
entailment model to
directly classify texts from different datasets.
Using mainly the plain document corpus
for
this task, however, has been rather less researched
so far. In one of the earlier approaches, Rao et al.
(2006) derived and assigned document labels based
on a k-means word clustering. Besides, Chen et al.
(2015) introduce descriptive latent Dirichlet allocation,
which could perform classification with only category
description words and unlabeled documents, thereby
eradicating the need for a large amount of world
knowledge from external sources. Since our approach
only needs some predefined topic keywords besides
the unlabeled document corpus, it also belongs to this
category. However, unlike previous approaches that
mainly used the plain document corpus, we do not
rely on term-document frequency scores but learn
new semantic embeddings from scratch, which was