Lbl2Vec An Embedding-Based Approach for Unsupervised Document Retrieval on Predeﬁned Topics Tim Schopf Daniel Braun and Florian Matthes

2025-04-27 0 0 488.33KB 9 页 10玖币

侵权投诉

Lbl2Vec: An Embedding-Based Approach for Unsupervised Document

Retrieval on Predeﬁned Topics

Tim Schopf, Daniel Braun and Florian Matthes

Department of Informatics, Technical University of Munich, Boltzmannstrasse 3, Garching, Germany

{tim.schopf, daniel.braun, matthes}@tum.de

Keywords: Natural Language Processing, Document Retrieval, Unsupervised Document Classiﬁcation.

Abstract:

In this paper, we consider the task of retrieving documents with predeﬁned topics from an unlabeled document

dataset using an unsupervised approach. The proposed unsupervised approach requires only a small number of

keywords describing the respective topics and no labeled document. Existing approaches either heavily relied

on a large amount of additionally encoded world knowledge or on term-document frequencies. Contrariwise,

we introduce a method that learns jointly embedded document and word vectors solely from the unlabeled

document dataset in order to ﬁnd documents that are semantically similar to the topics described by the

keywords. The proposed method requires almost no text preprocessing but is simultaneously effective at

retrieving relevant documents with high probability. When successively retrieving documents on different

predeﬁned topics from publicly available and commonly used datasets, we achieved an average area under the

receiver operating characteristic curve value of 0.95 on one dataset and 0.92 on another. Further, our method

can be used for multiclass document classiﬁcation, without the need to assign labels to the dataset in advance.

Compared with an unsupervised classiﬁcation baseline, we increased F1 scores from 76.6 to 82.7 and from

61.0 to 75.1 on the respective datasets. For easy replication of our approach, we make the developed

Lbl2Vec

code publicly available as a ready-to-use tool under the 3-Clause BSD license.a

ahttps://github.com/sebischair/Lbl2Vec

1 INTRODUCTION

In this paper, we combine the advantage of an unsuper-

vised approach with the possibility to predeﬁne topics.

Precisely, given a large number of unlabeled docu-

ments, we would like to retrieve documents related to

certain topics that we already know are present in the

corpus. This is becoming a common task, considering

not only the simplicity of retrieving documents by, e.g.,

scraping web pages, mails or other sources, but also

the labeling cost. For illustration purposes, we imagine

the following scenario: we possess a large number of

news articles extracted from sports sections of differ-

ent newspapers and would like to retrieve articles that

are related to certain sports, such as hockey, soccer

or basketball. Unfortunately, we can only rely on the

article texts for this task, as the metadata of the articles

contain no information about their content. Initially,

this appears like a common text classiﬁcation task.

However, there arise two issues that make the use of

conventional classiﬁcation methods unsuitable. First,

we would have to annotate our articles at a high cost,

as conventional supervised text classiﬁcation methods

need a large amount of labeled training data (Zhang

et al., 2020). Second, we might not be interested in

any sports apart from the previously speciﬁed ones.

However, our dataset of sports articles most likely also

includes articles on other sports, such as swimming or

running. If we want to apply a supervised classiﬁca-

tion method, we would either have to annotate even

those articles that are of no interest to us or think about

suitable previous cleaning steps, to remove unwanted

articles from our dataset. Both options would require

signiﬁcant additional expense.

In this paper, we present the

Lbl2Vec

approach, which

provides the retrieval of documents on predeﬁned top-

ics from a large corpus based on unsupervised learning.

This enables us to retrieve the wanted sports articles

related to hockey, soccer and basketball only, without

having to annotate any data. The proposed

Lbl2Vec

approach solely relies on semantic similarities between

documents and keywords describing a certain topic.

Using semantic meanings intuitively matches the ap-

proach of a human being and has previously been

arXiv:2210.06023v1 [cs.CL] 12 Oct 2022

proven to be capable of categorizing unlabeled texts

(Chang et al., 2008). With this approach, we signiﬁ-

cantly decrease the cost of annotating data, as we only

need a small number of keywords instead of a large

number of labeled documents.

Lbl2Vec

works by creating jointly embedded word,

document, and label vectors. The label vectors are de-

ducted from predeﬁned keywords of each topic. Since

label and document vectors are embedded in the same

feature space, we can subsequently measure their se-

mantic relationship by calculating their cosine similar-

ity. Based on this semantic similarity, we can decide

whether to assign a document to a certain topic or not.

We show that our approach produces reliable results

while saving annotation costs and requires almost no

text preprocessing steps. To this end, we apply our ap-

proach to two publicly available and commonly used

document classiﬁcation datasets. Moreover, we make

our

Lbl2Vec

code publicly available as a ready-to-use

tool.

2 RELATED WORK

Most related research can be summarized under

the notion of dataless classiﬁcation, introduced

by Chang et al. (2008). Broadly, this includes any

approach that aims to classify unlabeled texts based

on label descriptions only. Our approach differs

slightly from these, as we primarily attempt to retrieve

documents on predeﬁned topics from an unlabeled

document dataset without the need to consider

documents belonging to different topics of no interest.

Nevertheless, some similarities, such as the ability of

multiclass document classiﬁcation emerge, allowing a

rough comparison of our approach with those from

the dataless classiﬁcation, which can further be

divided along two dimensions: 1) semi-supervised

vs. unsupervised approaches and 2) approaches that

use a large amount of additional world knowledge

vs. ones that mainly rely on the plain document corpus.

Semi-supervised

approaches seek to annotate a

small subset of the document corpus unsupervised

and subsequently leverage the labeled subset to

train a supervised classiﬁer for the rest of the

corpus. In one of the earliest approaches that ﬁt

into this category, Ko and Seo (2000) derive training

sentences from manually deﬁned category keywords

unsupervised. Then, they used the derived sentences

to train a supervised Na

ıve Bayes classiﬁer with

minor modiﬁcations. Similarly, Liu et al. (2004)

extracted a subset of documents with keywords

and then applied a supervised Na

ıve Bayes-based

expectation–maximization algorithm (Dempster et al.,

1977) for classiﬁcation.

Unsupervised

approaches, by contrast, use similarity

scores between documents and target categories

to classify the entire unlabeled dataset. Haj-Yahia

et al. (2019) proposed keyword enrichment (

) and

subsequent unsupervised classiﬁcation based on latent

semantic analysis (

LSA

) (Deerwester et al., 1990)

vector cosine similarities. Another approach worth

mentioning in this context is the pure dataless hierar-

chical classiﬁcation used by Song and Roth (2014)

to evaluate different semantic representations. Our

approach also ﬁts into this unsupervised dimension,

as we do not employ document labels and retrieve

documents from the entire corpus based on cosine

similarities only.

A large amount of additional world knowl-

edge

from different data sources has been widely

exploited in many previous approaches to incorporate

more context into the semantic relationship between

documents and target categories. Chang et al. (2008)

used Wikipedia as source of world knowledge to

compute explicit semantic analysis embeddings

(Gabrilovich and Markovitch, 2007) of labels and

documents. Afterward, they applied the nearest

neighbor classiﬁcation to assign the most likely

label to each document. In this regard, their early

work had a major impact on further research, which

subsequently heavily focused on adding a lot of

world knowledge for dataless classiﬁcation. Yin et al.

(2019) used various public entailment datasets to

train a bidirectional encoder representations from

transformers (

BERT

) model (Devlin et al., 2019)

and used the pretrained

BERT

entailment model to

directly classify texts from different datasets.

Using mainly the plain document corpus

for

this task, however, has been rather less researched

so far. In one of the earlier approaches, Rao et al.

(2006) derived and assigned document labels based

on a k-means word clustering. Besides, Chen et al.

(2015) introduce descriptive latent Dirichlet allocation,

which could perform classiﬁcation with only category

description words and unlabeled documents, thereby

eradicating the need for a large amount of world

knowledge from external sources. Since our approach

only needs some predeﬁned topic keywords besides

the unlabeled document corpus, it also belongs to this

category. However, unlike previous approaches that

mainly used the plain document corpus, we do not

rely on term-document frequency scores but learn

new semantic embeddings from scratch, which was

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Lbl2Vec:AnEmbedding-BasedApproachforUnsupervisedDocumentRetrievalonPredenedTopicsTimSchopf,DanielBraunandFlorianMatthesDepartmentofInformatics,TechnicalUniversityofMunich,Boltzmannstrasse3,Garching,Germanyftim.schopf,daniel.braun,matthesg@tum.deKeywords:NaturalLanguageProcessing,DocumentRetrieval,U...

展开>> 收起<<

Lbl2Vec An Embedding-Based Approach for Unsupervised Document Retrieval on Predeﬁned Topics Tim Schopf Daniel Braun and Florian Matthes.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Lbl2Vec An Embedding-Based Approach for Unsupervised Document Retrieval on Predeﬁned Topics Tim Schopf Daniel Braun and Florian Matthes

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: