Lbl2Vec An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics Tim Schopf Daniel Braun and Florian Matthes

2025-04-27 0 0 488.33KB 9 页 10玖币
侵权投诉
Lbl2Vec: An Embedding-Based Approach for Unsupervised Document
Retrieval on Predefined Topics
Tim Schopf, Daniel Braun and Florian Matthes
Department of Informatics, Technical University of Munich, Boltzmannstrasse 3, Garching, Germany
{tim.schopf, daniel.braun, matthes}@tum.de
Keywords: Natural Language Processing, Document Retrieval, Unsupervised Document Classification.
Abstract:
In this paper, we consider the task of retrieving documents with predefined topics from an unlabeled document
dataset using an unsupervised approach. The proposed unsupervised approach requires only a small number of
keywords describing the respective topics and no labeled document. Existing approaches either heavily relied
on a large amount of additionally encoded world knowledge or on term-document frequencies. Contrariwise,
we introduce a method that learns jointly embedded document and word vectors solely from the unlabeled
document dataset in order to find documents that are semantically similar to the topics described by the
keywords. The proposed method requires almost no text preprocessing but is simultaneously effective at
retrieving relevant documents with high probability. When successively retrieving documents on different
predefined topics from publicly available and commonly used datasets, we achieved an average area under the
receiver operating characteristic curve value of 0.95 on one dataset and 0.92 on another. Further, our method
can be used for multiclass document classification, without the need to assign labels to the dataset in advance.
Compared with an unsupervised classification baseline, we increased F1 scores from 76.6 to 82.7 and from
61.0 to 75.1 on the respective datasets. For easy replication of our approach, we make the developed
Lbl2Vec
code publicly available as a ready-to-use tool under the 3-Clause BSD license.a
ahttps://github.com/sebischair/Lbl2Vec
1 INTRODUCTION
In this paper, we combine the advantage of an unsuper-
vised approach with the possibility to predefine topics.
Precisely, given a large number of unlabeled docu-
ments, we would like to retrieve documents related to
certain topics that we already know are present in the
corpus. This is becoming a common task, considering
not only the simplicity of retrieving documents by, e.g.,
scraping web pages, mails or other sources, but also
the labeling cost. For illustration purposes, we imagine
the following scenario: we possess a large number of
news articles extracted from sports sections of differ-
ent newspapers and would like to retrieve articles that
are related to certain sports, such as hockey, soccer
or basketball. Unfortunately, we can only rely on the
article texts for this task, as the metadata of the articles
contain no information about their content. Initially,
this appears like a common text classification task.
However, there arise two issues that make the use of
conventional classification methods unsuitable. First,
we would have to annotate our articles at a high cost,
as conventional supervised text classification methods
need a large amount of labeled training data (Zhang
et al., 2020). Second, we might not be interested in
any sports apart from the previously specified ones.
However, our dataset of sports articles most likely also
includes articles on other sports, such as swimming or
running. If we want to apply a supervised classifica-
tion method, we would either have to annotate even
those articles that are of no interest to us or think about
suitable previous cleaning steps, to remove unwanted
articles from our dataset. Both options would require
significant additional expense.
In this paper, we present the
Lbl2Vec
approach, which
provides the retrieval of documents on predefined top-
ics from a large corpus based on unsupervised learning.
This enables us to retrieve the wanted sports articles
related to hockey, soccer and basketball only, without
having to annotate any data. The proposed
Lbl2Vec
approach solely relies on semantic similarities between
documents and keywords describing a certain topic.
Using semantic meanings intuitively matches the ap-
proach of a human being and has previously been
arXiv:2210.06023v1 [cs.CL] 12 Oct 2022
proven to be capable of categorizing unlabeled texts
(Chang et al., 2008). With this approach, we signifi-
cantly decrease the cost of annotating data, as we only
need a small number of keywords instead of a large
number of labeled documents.
Lbl2Vec
works by creating jointly embedded word,
document, and label vectors. The label vectors are de-
ducted from predefined keywords of each topic. Since
label and document vectors are embedded in the same
feature space, we can subsequently measure their se-
mantic relationship by calculating their cosine similar-
ity. Based on this semantic similarity, we can decide
whether to assign a document to a certain topic or not.
We show that our approach produces reliable results
while saving annotation costs and requires almost no
text preprocessing steps. To this end, we apply our ap-
proach to two publicly available and commonly used
document classification datasets. Moreover, we make
our
Lbl2Vec
code publicly available as a ready-to-use
tool.
2 RELATED WORK
Most related research can be summarized under
the notion of dataless classification, introduced
by Chang et al. (2008). Broadly, this includes any
approach that aims to classify unlabeled texts based
on label descriptions only. Our approach differs
slightly from these, as we primarily attempt to retrieve
documents on predefined topics from an unlabeled
document dataset without the need to consider
documents belonging to different topics of no interest.
Nevertheless, some similarities, such as the ability of
multiclass document classification emerge, allowing a
rough comparison of our approach with those from
the dataless classification, which can further be
divided along two dimensions: 1) semi-supervised
vs. unsupervised approaches and 2) approaches that
use a large amount of additional world knowledge
vs. ones that mainly rely on the plain document corpus.
Semi-supervised
approaches seek to annotate a
small subset of the document corpus unsupervised
and subsequently leverage the labeled subset to
train a supervised classifier for the rest of the
corpus. In one of the earliest approaches that fit
into this category, Ko and Seo (2000) derive training
sentences from manually defined category keywords
unsupervised. Then, they used the derived sentences
to train a supervised Na
¨
ıve Bayes classifier with
minor modifications. Similarly, Liu et al. (2004)
extracted a subset of documents with keywords
and then applied a supervised Na
¨
ıve Bayes-based
expectation–maximization algorithm (Dempster et al.,
1977) for classification.
Unsupervised
approaches, by contrast, use similarity
scores between documents and target categories
to classify the entire unlabeled dataset. Haj-Yahia
et al. (2019) proposed keyword enrichment (
KE
) and
subsequent unsupervised classification based on latent
semantic analysis (
LSA
) (Deerwester et al., 1990)
vector cosine similarities. Another approach worth
mentioning in this context is the pure dataless hierar-
chical classification used by Song and Roth (2014)
to evaluate different semantic representations. Our
approach also fits into this unsupervised dimension,
as we do not employ document labels and retrieve
documents from the entire corpus based on cosine
similarities only.
A large amount of additional world knowl-
edge
from different data sources has been widely
exploited in many previous approaches to incorporate
more context into the semantic relationship between
documents and target categories. Chang et al. (2008)
used Wikipedia as source of world knowledge to
compute explicit semantic analysis embeddings
(Gabrilovich and Markovitch, 2007) of labels and
documents. Afterward, they applied the nearest
neighbor classification to assign the most likely
label to each document. In this regard, their early
work had a major impact on further research, which
subsequently heavily focused on adding a lot of
world knowledge for dataless classification. Yin et al.
(2019) used various public entailment datasets to
train a bidirectional encoder representations from
transformers (
BERT
) model (Devlin et al., 2019)
and used the pretrained
BERT
entailment model to
directly classify texts from different datasets.
Using mainly the plain document corpus
for
this task, however, has been rather less researched
so far. In one of the earlier approaches, Rao et al.
(2006) derived and assigned document labels based
on a k-means word clustering. Besides, Chen et al.
(2015) introduce descriptive latent Dirichlet allocation,
which could perform classification with only category
description words and unlabeled documents, thereby
eradicating the need for a large amount of world
knowledge from external sources. Since our approach
only needs some predefined topic keywords besides
the unlabeled document corpus, it also belongs to this
category. However, unlike previous approaches that
mainly used the plain document corpus, we do not
rely on term-document frequency scores but learn
new semantic embeddings from scratch, which was
摘要:

Lbl2Vec:AnEmbedding-BasedApproachforUnsupervisedDocumentRetrievalonPredenedTopicsTimSchopf,DanielBraunandFlorianMatthesDepartmentofInformatics,TechnicalUniversityofMunich,Boltzmannstrasse3,Garching,Germanyftim.schopf,daniel.braun,matthesg@tum.deKeywords:NaturalLanguageProcessing,DocumentRetrieval,U...

展开>> 收起<<
Lbl2Vec An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics Tim Schopf Daniel Braun and Florian Matthes.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:488.33KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注