ProSiT Latent Variable Discovery with PROgressive SImilarity Thresholds Tommaso Fornaciari

2025-05-02 0 0 2.14MB 17 页 10玖币

侵权投诉

ProSiT! Latent Variable Discovery with PROgressive SImilarity

Thresholds

Tommaso Fornaciari

Italian National Police ∗

Dirk Hovy

Bocconi University †

Federico Bianchi

Stanford University ‡

Abstract

The most common ways to explore latent docu-

ment dimensions are topic models and cluster-

ing methods. However, topic models have sev-

eral drawbacks: e.g., they require us to choose

the number of latent dimensions a priori, and

the results are stochastic. Most clustering

methods have the same issues and lack ﬂexibil-

ity in various ways, such as not accounting for

the inﬂuence of different topics on single doc-

uments, forcing word-descriptors to belong to

a single topic (hard-clustering) or necessarily

relying on word representations. We propose

PROgressive SImilarity Thresholds - ProSiT, a

deterministic and interpretable method, agnos-

tic to the input format, that ﬁnds the optimal

number of latent dimensions and only has two

hyper-parameters, which can be set efﬁciently

via grid search. We compare this method with

a wide range of topic models and clustering

methods on four benchmark data sets. In most

setting, ProSiT matches or outperforms the

other methods in terms six metrics of topic co-

herence and distinctiveness, producing replica-

ble, deterministic results.

1 Introduction

Latent variable models are essential for data explo-

ration and are commonly used in social science and

business applications. However, the most common

methods used to explore these models, topic mod-

els and clustering, come with a set of drawbacks,

chief among them their stochastic nature. This

makes results hard to replicate precisely, as they

require constant interpretation. In this paper, we

instead propose a latent variable discovery method

that produces deterministic results.

Topic models, such as latent Dirichlet allocation

(LDA) (Blei et al.,2003), rely on a number of top-

ics that are provided a priori, reﬁning their initial

∗tommaso.fornaciari@poliziadistato.it

†dirk.hovy@unibocconi.it

‡fede@stanford.edu

random guesses to establish probability distribu-

tion iteratively. This approach guarantees that the

ﬁnal state improves over the initial one in terms

of topic quality, but the method is vulnerable to

local optima. This problem can be mitigated by

initializing several models and picking the most

useful/interpretable solution. Selection usually re-

lies on coherence measures that estimate the topics’

reliability. However, the method also expects the

user to provide the number of topics as a parameter.

Therefore the probability of ﬁnding good results

depends on previous domain knowledge about the

data set - which is often an unreasonable expec-

tation. In these cases, it is difﬁcult to interpret

or justify the resulting topics theoretically, as they

depend on an arbitrary starting point. Moreover,

since each model’s random initialization will pro-

vide slightly different results, these results are de-

batable.

Clustering methods, another popular choice for

exploring latent document dimensions, similarly

rely on the number of latent dimensions as in-

put and their random initialization. For exam-

ple,

-Means (Sculley,2010) can extract topics

from groups of documents identiﬁed by minimiz-

ing the distance between the documents’ represen-

tations. The procedure partitions the vectors space

in Voronoi cells, where each document exclusively

contributes to the identiﬁcation of a single cluster

(Burrough et al.,2015). However, in many sce-

narios, the distinction between clusters/topics is

nuanced, and we expect to ﬁnd borderline texts that

can be connected to different topics, which should

not, in turn, be artiﬁcially forced to belong to a

unique cluster.

To overcome these limitations, we propose

ProSiT, a deterministic and interpretable latent

variable discovery algorithm that is entirely repro-

ducible. ProSiT is agnostic to the input space (em-

beddings, discrete textual, and even non-textual

features), allowing for ﬂexible use and inheriting

arXiv:2210.14763v1 [cs.CL] 26 Oct 2022

the properties of the input representations of what-

ever nature. For example, with documents encoded

by a multi-lingual model, the identiﬁed topics will

also be multi-lingual.

ProSiT is entirely data-driven and does not de-

pend on any guess or random initialization. Instead,

as the name suggests, it relies on the similarity be-

tween texts, i.e., their distances in the vector space.

The only theoretical assumption we make is that

documents treating similar topic(s) will contain

similar words, and therefore will be in proximate

regions of the vector space.In geometric terms, we

assume the topic’s convexity in the vector space,

which implies that the distance between points rep-

resents their degree of similarity.

We evaluate ProSiT on four commonly used data

sets against eight common and State-Of-The-Art

latent variable discovery methods. We evaluate

the performance with six standard metrics of latent

topic coherence and distinctiveness. We ﬁnd that

ProSiT is always comparable and often superior

to State-Of-The-Art methods but requires fewer

parameters and no prior selection of latent compo-

nents.

Contributions

We propose ProSiT, a novel al-

gorithm for latent variable discovery. It is inter-

pretable, deterministic, input agnostic, fuzzy, com-

putationally efﬁcient, and effective. We release

ProSiT as a PyPi package. The documentation can

be found on GitHub.1

2 ProSiT

ProSiT takes as input a corpus of documents. The

algorithm has two main steps: 1) identifying the

latent variables (from here on: topics) and 2) ex-

tracting words describing these topics.

ProSiT is efﬁcient for inductive learning: once

trained, the topics are represented as points in the

same vector space as the training documents. To

evaluate unseen documents is as simple as com-

puting their distance from the topics coordinates.

The only requirement for new documents is that

they are represented in the same vector space as the

training data.

2.1 Step 1: Finding topics

The ﬁrst step is an iterative process that identiﬁes

several potential topic sets. The number of topics

is data-driven, rather than decided a priori. ProSiT

uses different degrees of similarity between groups

1https://github.com/fornaciari/prosit

of documents to determine whether they belong to

a topic. These similarity levels are determined by a

threshold, which changes at each iteration step and

follows a hyperbolic curve. A parameter we call

tunes the slope of this curve.

Formally, we assume a set of documents

Each document is represented in two ways: as

-dimensional vector

(continuous or sparse),

and as a list of strings

, representing the (pre-

processed) document with space-separated words.

We use the matrix

over every document vec-

tor

to determine the topics and

over every

string representation

to extract the topic descrip-

tors.

We set a similarity threshold

and iterate over

with the following procedure until it stops at the

smallest number of topics found:

removing any (possible) duplicated vectors

to prevent topics being affected by repeated

documents. This step could seem unneces-

sary, but many corpora contain repeated texts.

Since ProSiT models topics as clusters of their

associated documents’ average representation,

removing duplicates reduces the impact of re-

peated documents on topic identiﬁcation.

computing the cosine similarity between all

documents in M:

C=MM>

kMk kM>k(1)

The norms are computed row-wise and multi-

plied as column and row vectors, respectively.

The resulting matrix

is a square, symmetric

similarity matrix over the unique input docu-

ments.

mapping each data point to the set of all its

neighbors (including itself), that is, the set

of points more similar than the threshold

The output of this procedure is a list of redun-

dant sets. However, it also identiﬁes isolated

instances too far from others, resulting in a

singleton set. Both situations are simpliﬁed in

Figure 1.

removing the groups that are repeated subsets

of other groups;

computing the groups’ centroids, i.e., the doc-

uments’ average vector;

Figure 1: Examples of data points (lowercase letters)

and the relative neighbourhood areas (uppercase let-

ters), determined by α. The dotted lined represent sets

of neighbors that are subsets of other sets.

assigning isolated instances (outliers) to the

group having the closest centroid;

recomputing the cluster centroids, including

the outliers in the groups to which they were

assigned.

re-starting the iteration, using the centroids as

new data points (keeping track of the original

data points associated with these centroids)

Similarly to agglomerative hierarchical cluster-

ing, each document is initially considered its own

topic. The number of topics is reduced by itera-

tively collapsing the documents together. However,

ProSiT and agglomerative hierarchical clustering

differ substantially. In the latter, pairs of instances

are iteratively joined, creating a tree that is rep-

resentable by dendrograms and in which each in-

stance belongs to one cluster only. In ProSiT, as

shown in Figure 1, we create sets of instances ly-

ing within a similarity threshold, where overlap

between sets is allowed. Therefore, a given in-

stance can contribute to the centroid computation

of more than one cluster, leading to smoother topic

representations.

However, there is a problem: Iteratively aver-

aging document vectors pushes the resulting cen-

troids towards the mean of means, i.e., the global

centroid of the entire data set. To prevent the docu-

ment representations from collapsing in one point,

in ProSiT we set an increasingly higher cosine sim-

ilarity threshold (

CST

) at every training iteration

iter according to a hyperbolic function

CST =iter −α

iter (2)

whose slope depends on the αparameter.

Figure 2: Some examples of the αcurve.

Since the thresholds are cosine similarity values,

the maximal value is 1. We will therefore want an

value close to 0. Figure 2shows the hyperbolic

curve for various values of α.

ProSiT does not require the number of top-

ics/clusters as an input: They emerge from the

process. Topics correspond to the number of clus-

ters identiﬁed at each iteration, which become pro-

gressively lower until the algorithm reaches con-

vergence. The convergence is reached when all

the computed centroids are farther (or less similar)

from each other than the threshold, α.

Since the topics are represented as points in the

documents’ vector space, we can compute the afﬁn-

ity between any (unseen) document and each topic

in terms of distance, similarly to LDA, which mea-

sures the topics’ presence in each document.

2.2 Step 2: Selecting topic descriptors

Once ProSiT has identiﬁed a set of topics, the sec-

ond step consists of selecting the documents that

best represent each topic and extracting the most

representative terms from them. While it is possi-

ble to extract the descriptors at every iteration (each

corresponding to a different number of topics), for

ease of exposition, we show the results for topics

ranging between 5 and 25.

In hard clustering methods, identifying the doc-

uments that provide the descriptors is straightfor-

ward, as every document belongs to a unique topic.

In our case, though, we consider the continuous

distance between documents and topics so that doc-

uments can be associated with several topics with

different distance levels. We are interested in se-

lecting the most representative documents, i.e., the

closest ones to the topic centroids. To achieve this,

we use another threshold,

, to deﬁne how close

to the topics a document should be for the topics

to be used for descriptors.

describes

percent

documents closest to each topic centroid, ranging

from 0 to 1 (where 1 means the whole data set).

Once we have selected the set of documents that

are most representative of each topic, we extract the

descriptors using the information gain (IG) (For-

man et al.,2003) value, computed for each topic.

IG measures the entropy of features selected from

some instances belonging to different classes. It is

usually used for feature selection from instances

in binary classiﬁcation tasks. Here we propose an

original use of IG in a multi-class scenario, where

the topics represent the classes. We rank the terms

according to their IG (i.e., probability of belonging

to a topic) and select the top

words. This proce-

dure, similarly to the document-topics’ afﬁnity, can

measure the afﬁnity between words and topics.

3 Experimental Settings

3.1 Data sets

We test ProSiT on four data sets, two with long

documents and two with short documents. For

long documents, we use the Reuters and Google

News data sets,

which have previously been used

by Sia et al. (2020) and Qiang et al. (2020), re-

spectively. For short documents, we use Wikipedia

abstracts from DBpedia,

the same data set used

in by Bianchi et al. (2021b), to which we refer

as Wiki20K, and a tweet data set, used for topic

models by Qiang et al. (2020). Wiki20K (Bianchi

et al.,2021b) contains Wikipedia abstracts ﬁltered

to consist of only the 2,000 most frequent words of

the vocabulary. Tweet and Google News are stan-

dard data sets in the community and were released

by Qiang et al. (2020); both data sets have been

preprocessed (e.g., stop words have been removed).

Table 1contains descriptive statistics for the data

sets. We use a small vocabulary size, which is

desirable for most Topic Models scenarios, where

including extremely-low frequency terms would

result in a too ﬁne-grained number of topics.

3.2 Metrics

We evaluate the topics with four metrics for coher-

ence and two for distinctiveness. First, we use stan-

dard coherence measures, that is

and Normal-

ized Pointwise Mutual Information (NPMI) (Röder

et al.,2015). Also, we consider the Rank-Biased

The Reuters data set can be found at

https://www.

nltk.org/book/ch02.html.

The abstracts can be found at

https://wiki.

dbpedia.org/downloads-2016-10.

Data set Docs Vocab. M. words pre-p.

Reuters 10,788 949 130.11 55.02

Google News 10,950 2,000 191.98 68.00

Wiki20K 20,000 2,000 49.82 17.44

Tweets2011 2,472 5098 8.56 8.56

Table 1: Corpora statistics, with vocabulary size and

mean words with and without pre-processing.

Overlap (RBO) (Webber et al.,2010), a discrete

measure of overlap between sequences. We also

use the inverted RBO (IRBO) score, that is 1 - RBO

(Bianchi et al.,2021a). This score describes how

different the different topics are on average. Lastly,

similarly to the approach of Ding et al. (2018),

we use an external word embedding-based coher-

ence measure (WECO) to compute the coherence

on an external domain. This metric computes the

average pairwise similarity of the words in each

topic and averages the results. We use the standard

GoogleNews word embedding commonly used in

the literature (Mikolov et al.,2013).

Concerning the distinctiveness, that is, how

clearly the topics differentiate from each other, fol-

lowing Mimno and Lee (2014) we measure Topic

Speciﬁcity (TS) and Topic Dissimilarity (TD). The

ﬁrst is the average Kullback-Leibler divergence

(Kullback and Leibler,1951) from each topic’s

conditional distribution to the corpus distribution;

the second is based on the conditional distribution

of words across different topics.

While we discuss the outcomes of all metrics,

for space constraints, we only show the results of

CV, reporting the whole results in Appendix.

3.3 Baselines

We compare ProSiT with two groups of models,

differing by the text representations they require as

input: embeddings or sparse count features. ProSiT

allows us to use each of them.

All comparison methods require the number of

latent topics as an a-priori input parameter. We

evaluate their performance for inputs of 5, 10, 15,

20, and 25 topics to show a deﬁned range. How-

ever, recall that ProSiT does not take the number of

topics as input but instead identiﬁes them automati-

cally. Therefore, we also evaluate the other models

on those numbers of topics that ProSiT ﬁnds.

In the ﬁrst group, we consider contextualized

topic models (Bianchi et al.,2021a, CTM) and Ze-

roShot topic models (Bianchi et al.,2021b, ZSTM).

They introduce the use of contextual embeddings

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProSiT!LatentVariableDiscoverywithPROgressiveSImilarityThresholdsTommasoFornaciariItalianNationalPoliceDirkHovyBocconiUniversityyFedericoBianchiStanfordUniversityzAbstractThemostcommonwaystoexplorelatentdocu-mentdimensionsaretopicmodelsandcluster-ingmethods.However,topicmodelshavesev-eraldrawbacks:...

展开>> 收起<<

ProSiT Latent Variable Discovery with PROgressive SImilarity Thresholds Tommaso Fornaciari.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ProSiT Latent Variable Discovery with PROgressive SImilarity Thresholds Tommaso Fornaciari

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: